Introduction

This tutorial introduces regression modeling using R. The R-markdown document for the tutorial can be downloaded here.

Regression models are among the most widely used methods in data analysis because they are/can

  • incorporate many predictors in a single model (multivariate: allows to test the impact of one predictor while the impact of (all) other predictors is controlled for)
  • extremely flexible and and can be fitted to different types of predictors and dependent variables
  • provide output that can be easily interpreted
  • conceptually relative simple and not overly complex from a mathematical perspective

There are two basic types f regression models: fixed-effects regression models and mixed-effects regression models. The first part of this tutorial focuses on fixed-effects regression models while the second part focuses on mixed-effects regression models (the difference lies in the fact that fixed-effects regression models allow us to model hierarchical or nested data - more on that in the second part of this tutorial).

Fixed-effects regression models are simple additive models which means that the predicted values represent the intercept value plus the effects of the individual predictors while mixed-effects models are based on more complex matrix multiplications where predicted values represent the product of the random effect multiplied by the intercept values plus the effects of the fixed effects. R offers a various ready-made functions with which implementing different types of regression models is very easy.

In the following, we will go over the most relevant and frequently used types of regression models:

  • simple linear regression

  • multiple linear regression

  • multiple binomial logistic regression

  • ordinal regression

  • Poisson regression

  • robust regression

The major difference between these types of models is that they take different types of dependent variables: linear regressions take numeric , logistic regressions take nominal variables, ordinal regressions take ordinal variables, and Poisson regressions take dependent variables that reflect counts of (rare) events. Robust regression, in contrast, is a simple multiple linear regression that is able to handle outliers due to a weighing procedure.

Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

# clean current workspace
rm(list=ls(all=T))
# set options
options(stringsAsFactors = F)         # no automatic data transformation
options("scipen" = 100, "digits" = 4) # suppress math annotation
# install packages
install.packages(c("boot", "car", "caret", "dplyr",  "effects", "foreign", "ggplot2", 
                   "Hmisc", "DT", "knitr", "lme4", "MASS", "mlogit", "msm", 
                   "QuantPsyc", "reshape2", "rms", "sandwich", "sfsmisc", "sjPlot", 
                   "stringr", "vcd", "visreg", "MuMIn"))

Once you have installed R-Studio and initiated the session by executing the code shown above, you are good to go.

1 Fixed Effects Regression

Before turning to mixed-effects models which are able to represent hierarchical data structures, we will focus on traditional fixed effects regression models and begin with multiple linear regression.

1.1 Simple Linear Regression

This section focuses on a very widely used statistical method which is called regression. Regressions are used when we try to understand how independent variables correlate with a dependent or outcome variable. So, if you want to investigate how a certain factor affects an outcome, then a regression is the way to go. We will have a look at two simple examples to understand what the concepts underlying a regression mean and how a regression works. The R-Code, that we will use, is adapted from Field, Miles, and Field (2012) - which is highly recommended for understanding regression analyses! In addition to Field, Miles, and Field (2012), there are various introductions which also focus on regression (among other types of analyses), for example, Gries (2009), Levshina (2015), Wilcox (2009) - Baayen (2008) is also very good but probably not the first book one should read about statistics.

Although the basic logic underlying regressions is identical to the conceptual underpinnings of analysis of variance (ANOVA), a related method, sociolinguistists have traditionally favoured regression analysis in their studies while ANOVAs have been the method of choice in psycholinguistics. The preference for either method is grounded in historical happenstances and the culture of these subdisciplines rather than in methodological reasoning.

A minor difference between regressions and ANOVA lies in the fact that regressions are based on the \(t\)-distribution while ANOVAs use the \(F\)-distribution (however, the \(F\)-value is simply the value of \(t\) squared or t2). Both \(t\)- and \(F\)-values report on the ratio between explained and unexplained variance.

The idea behind regression analysis is expressed formally in the equation below where\(f_{(x)}\) is the \(y\)-value we want to predict, \(\alpha\) is the intercept (the point where the regression line crosses the \(y\)-axis), \(\beta\) is the coefficient (the slope of the regression line).

\(f_{(x)} = \alpha + \beta_{1}x_{i} + \epsilon\)

In other words, to estimate how much some weights who is 180cm tall, we would multiply the coefficent (slope of the line) with 180 (\(x\)) and add the value of the intercept (point where line crosses the \(y\)-axis).

However, the idea behind regressions can best be described graphically: imagine a cloud of points (like the points in the scatterplot below). Regressions aim to find that line which has the minimal summed distance between points and the line (like the line in the right panel). Technically speaking, the aim of a regression is to find the line with the minimal deviance (or the line with the minimal sum of residuals). Residuals are the distance between the line and the points (the red lines) and it is also called variance.

Thus, regression lines are those lines where the sum of the red lines should be minimal. The slope of the regression line is called coefficient and the point where the regression line crosses the y-axis is called the intercept.

A word about standard errors (SE) is in order here because most commonly used statistics programs will provide SE values when reporting regression models. The SE is a measure that tells us how much the coefficients were to vary if the same regression were applied to many samples from the same population. A relatively small SE value therefore indicates that the coefficients will remain very stable if the same regression model is fitted to many different samples with identical parameters. In contrast, a large SE tells you that the model is volatile and not very stable or reliable as the coefficients vary substantially if the model is applied to many samples.

Example 1: Preposition Use across Real-Time

We will now turn to our first example. In this example, we will investigate whether the frequency of prepositions has changed from Middle English to Late Modern English. The reasoning behind this example is that Old English was highly synthetic compared with Present-Day English which comparatively analytic. In other words, while Old English speakers used case to indicate syntactic relations, speakers of Present-Day English use word order and prepositions to indicate syntactic relationships. This means that the loss of case had to be compensated by different strategies and maybe these strategies continued to develop and increase in frequency even after the change from synthetic to analytic had been mostly accomplished. And this prolonged change in compensatory strategies is what this example will focus on.

The analysis is based on data extracted from the Penn Corpora of Historical English (see http://www.ling.upenn.edu/hist-corpora/), that consists of 603 texts written between 1125 and 1900. In preparation of this example, all elements that were part-of-speech tagged as prepositions were extracted from the PennCorpora.

Then, the relative frequencies (per 1,000 words) of prepositions per text were calculated. This frequency of prepositions per 1,000 words represents our dependent variable. In a next step, the date when each letter had been written was extracted. The resulting two vectors were combined into a table which thus contained for each text, when it was written (independent variable) and its relative frequency of prepositions (dependent or outcome variable).

A regression analysis will follow the steps described below: 1. Extraction and processing of the data 2. Data visualization 3. Applying the regression analysis to the data 4. Diagnosing the regression model and checking whether or not basic model assumptions have been violated.

In a first step, we load the libraries and functions.

# load libraries
library(car)
library(dplyr)
library(ggplot2)
library(knitr)         
library(QuantPsyc)  
library(DT)
# load functions
source("https://slcladal.github.io/rscripts/multiplot.r")
source("https://slcladal.github.io/rscripts/slrsummary.r")

After preparing our session, we can now load and inspect the data to get a first impression of its properties.

# load data
slrdata <- read.delim("https://slcladal.github.io/data/lmmdata.txt", header = TRUE)
# inspect data
head(slrdata)                                   
##   Date         Genre    Text Prepositions Region
## 1 1736       Science   albin       166.01  North
## 2 1711     Education    anon       139.86  North
## 3 1808 PrivateLetter  austen       130.78  North
## 4 1878     Education    bain       151.29  North
## 5 1743     Education barclay       145.72  North
## 6 1908     Education  benson       120.77  North

Inspecting the data is very important because it can happen that a data set may not load completely or that variables which should be numeric have been converted to character variables. If unchecked, then such issues could go unnoticed and cause trouble.

We will now plot the data to get a better understanding of what the data looks like.

ggplot(slrdata, aes(Date, Prepositions)) +
  geom_point() +
  theme_bw() +
  labs(x = "Year") +
  labs(y = "Prepositions per 1,000 words") +
  geom_smooth()

ggplot(slrdata, aes(Date, Prepositions)) +
  geom_point() +
  theme_bw() +
  labs(x = "Year") +
  labs(y = "Prepositions per 1,000 words") +
  geom_smooth(method = "lm") # with linear model smoothing!

Before beginning with the regression analysis, we will scale the year. We scale by subtracting each value from the mean of year. This can be useful when dealing with numeric variables because if we did not scale year, we would get estimated values for year 0 (a year when English did not even exist yet). If a variable is scaled, the regression provides estimates of the model refer to the mean of that numeric variable. In other words, scaling can eb very helpful, especially with respect to the interpretation of the results that regression models report.

# scale date
slrdata$Date <- slrdata$Date - mean(slrdata$Date) 

We will now begin the regression analysis by generating a first regression model. and inspect its results.

# create initial model
m1.lm <- lm(Prepositions ~ Date, data = slrdata)
# inspect results
summary(m1.lm)
## 
## Call:
## lm(formula = Prepositions ~ Date, data = slrdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.101 -13.855   0.578  13.321  62.858 
## 
## Coefficients:
##               Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 132.190093   0.838637 157.625 <0.0000000000000002 ***
## Date          0.017322   0.007267   2.383              0.0175 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.43 on 535 degrees of freedom
## Multiple R-squared:  0.01051,    Adjusted R-squared:  0.008657 
## F-statistic: 5.681 on 1 and 535 DF,  p-value: 0.0175

The summary output starts by repeating the regression equation. Then, the model provides the distribution of the residuals. The residuals should be sitributed normally with the Min and Max as well as the 1Q (first quartile) and 3Q (third quartile) being similar or ideally identical. In our case, the values are very similar which suggests that the residuals are distributed evenly and follow a normal distribution. The next part of the repost is the coefficients table. The Estimate for the intercept is the value where the regression line crosses the y-axis. The estimate for Date represents the slope of the regression line and tells us that with each year, the predicted frequency of prepositions incerase by .01732 prepositions. The t-value is the Estimate divided by the standard error (Std. Error). Based on the t-value, the p-value can be calculated manually as shown below.

# use pt function (which uses t-values and the degrees of freedom)
2*pt(-2.383, nrow(slrdata)-1)
## [1] 0.01751964

The R2-values tell us how much variance is explained by our model compared to the overall variance (0.0105 means that our model explains only 1.05 percent of the variance - which is a tiny amount). The adjusted R2-value tell us how much variance the model would explain if we applied the model to new data of the same nature (which data points taken from the same population). Or, to be more precise, the adjusted R2 takes the number of predictors into account: if a model contains predictors that do not explain much variance, then the adjusted R2 will be lower than the multiple R2 as the former penalizes models for having superfluous predictors. If there is a big difference between the two R2-values, then the model is overfitted which is not good. The F-statistic and the associated p-value tell us that the model, despite explaining almost no variance, is still significantly better than an intercept-only base-line model (or using the overall mean to predict the frequency of prepositions per text).

We can test this and also see where the F-values comes from by comparing the

# create intercept-only base-line model
m0.lm <- lm(Prepositions ~ 1, data = slrdata)
# compare the base-line and the more saturated model
anova(m1.lm, m0.lm, test = "F")
## Analysis of Variance Table
## 
## Model 1: Prepositions ~ Date
## Model 2: Prepositions ~ 1
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)  
## 1    535 202058                             
## 2    536 204204 -1   -2145.6 5.6809 0.0175 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F- and p-values are exactly those reported by the summary which shows where the F-values comes from and what it means; namely it denote the difference between the base-line and the more saturated model.

The degrees of freedom associated with the residual standard error are the number of cases in the model minus the number of predictors (including the intercept). The residual standard error is square root of the sum of the squared residuals of the model divided by the degrees of freedom. Have a look at he following to clear this up:

# DF = N - number of predictors (including intercept)
DegreesOfFreedom <- nrow(slrdata)-length(coef(m1.lm))
# sum of the squared residuals
SumSquaredResiduals <- sum(resid(m1.lm)^2)
# Residual Standard Error
sqrt(SumSquaredResiduals/DegreesOfFreedom); DegreesOfFreedom
## [1] 19.43396
## [1] 535

We will now check if mathematical assumptions have been violated (homogeneity of variance) or whether the data contains outliers. We check this using diagnostic plots.

# plot model: 3 plots per row in one window
par(mfrow = c(1, 3))
plot(resid(m1.lm))
plot(rstandard(m1.lm))
plot(rstudent(m1.lm)); par(mfrow = c(1, 1)) # restore default parameters

The left graph shows the residuals of the model (i.e., the differences between the observed and the values predicted by the regression model). The problem with this plot is that the residuals are not standardized and so they cannot be compared to the residuals of other models. To remedy this deficiency, residuals are normalized by dividing the residuals by their standard deviation. Then, the normalized residuals can be plotted against the observed values (centre panel). In this way, not only are standardized residuals obtained, but the values of the residuals are transformed into z-values, and one can use the z-distribution to find problematic data points. There are three rules of thumb regarding finding problematic data points through standardized residuals (Field, Miles, and Field 2012, 268–69):

  • Points with values higher than 3.29 should be removed from the data.

  • If more than 1% of the data points have values higher than 2.58, then the error rate of our model is too high.

  • If more than 5% of the data points have values greater than 1.96, then the error rate of our model is too high.

The right panel shows the * studentized residuals* (adjusted predicted values: each data point is divided by the standard error of the residuals). In this way, it is possible to use Student’s t-distribution to diagnose our model.

Adjusted predicted values are residuals of a special kind: the model is calculated without a data point and then used to predict this data point. The difference between the observed data point and its predicted value is then called the adjusted predicted value. In summary, studentized residuals are very useful because they allow us to identify influential data points.

The plots show that there are two potentially problematic data points (the top-most and bottom-most point). These two points are clearly different from the other data points and may therefore be outliers. We will test later if these points need to be removed.

We will now generate more diagnostic plots.

par(mfrow = c(2, 2)) # plot window: 2 plots/row, 2 plots/column
plot(m1.lm); par(mfrow = c(1, 1)) # restore normal plot window

The diagnostic plots are very positive and we will go through why this is so for each panel. The graph in the upper left panel is useful for finding outliers or for determining the correlation between residuals and predicted values: when a trend becomes visible in the line or points (e.g., a rising trend or a zigzag line), then this would indicate that the model would be problematic (in such cases, it can help to remove data points that are too influential (outliers)).

The graphic in the upper right panel indicates whether the residuals are normally distributed (which is desirable) or whether the residuals do not follow a normal distribution. If the points lie on the line, the residuals follow a normal distribution. For example, if the points are not on the line at the top and bottom, it shows that the model does not predict small and large values well and that it therefore does not have a good fit.

The graphic in the lower left panel provides information about homoscedasticity. Homoscedasticity means that the variance of the residuals remains constant and does not correlate with any independent variable. In unproblematic cases, the graphic shows a flat line. If there is a trend in the line, we are dealing with heteroscedasticity, that is, a correlation between independent variables and the residuals, which is very problematic for regressions.

The graph in the lower right panel shows problematic influential data points that disproportionately affect the regression (this would be problematic). If such influential data points are present, they should be either weighted (one could generate a robust rather than a simple linear regression) or they must be removed. The graph displays Cook’s distance, which shows how the regression changes when a model without this data point is calculated. The cook distance thus shows the influence a data point has on the regression as a whole. Data points that have a Cook’s distance value greater than 1 are problematic (Field, Miles, and Field 2012, 269).

The so-called leverage is also a measure that indicates how strongly a data point affects the accuracy of the regression. Leverage values range between 0 (no influence) and 1 (strong influence: suboptimal!). To test whether a specific data point has a high leverage value, we calculate a cut-off point that indicates whether the leverage is too strong or still acceptable. The following two formulas are used for this:

\[\begin{equation} \frac{3(k + 1)}{n} \end{equation}\]

or

\[\begin{equation} \frac{2(k + 1)}{n} \end{equation}\]

We will look more closely at leverage in the context of multiple linear regression and will therefore end the current analysis by summarizing the results of the regression analysis in a table.

# create summary table
slrresults <- slrsummary(m1.lm)  
# show summary table
slrresults
Results of a simple linear regression analysis.
Estimate Pearson’s r Std. Error t value Pr(>|t|) P-value sig.
(Intercept) 132.19 0.84 157.62 0 p < .001***
Date 0.02 0.1 0.01 2.38 0.0175 p < .05*
Model statistics Value
Number of cases in model 537
Residual standard error on 535 DF 19.43
Multiple R-squared 0.0105
Adjusted R-squared 0.0087
F-statistic (1, 535) 5.68
Model p-value 0.0175

Typically, the results of regression analyses are presented in such tables as they include all important measures of model quality and significance, as well as the magnitude of the effects.

In addition, the results of simple linear regressions should be summarized in writing. An example of how the results of a regression analysis can be written up is provided below.

A simple linear regression has been fitted to the data. A visual assessment of the model diagnostic graphics did not indicate any problematic or disproportionately influential data points (outliers) and performed significantly better compared to an intercept-only base line model but only explained .87 percent of the vraiance (Adjusted R2: .0087, F-statistic (1, 535): 5,68, p-value: 0.0175*). The final minimal adequate linear regression model is based on 537 data points and confirms a significant and positive correlation between the year in which the text was written and the relative frequency of prepositions (coefficient estimate: .02, SE: 0.01, t-value: 2.38, p-value: .0175*).

Example 2: Teaching Styles

In the previous example, we dealt with two numeric variables, while the following example deals with a categorical independent variable and a numeric dependent variable. The ability for regressions to handle very different types of variables makes regressions a widely used and robust method of analysis.

In this example, we are dealing with two groups of students that have been randomly assigned to be exposed to different teaching methods. Both groups undergo a language learning test after the lesson with a maximum score of 20 points.

The question that we will try to answer is whether the students in group A have performed significantly better than those in group B which would indicate that the teaching method to which group A was exposed works better than the teaching method to which group B was exposed.

Let’s move on to implementing the regression in “R”. In a first step, we load the data set and inspect its structure.

# load data
slrdata2 <- read.delim("https://slcladal.github.io/data/slrdata2.txt", sep = "\t", header = T)
# inspect data
head(slrdata2)
##   Group Score
## 1     A    15
## 2     A    12
## 3     A    11
## 4     A    18
## 5     A    15
## 6     A    15

Now, we graphically display the data. In this case, a boxplot represents a good way to visualize the data.

# extract means
means <- slrdata2 %>%
  dplyr::group_by(Group) %>%
  dplyr::summarise(Mean = round(mean(Score), 1), SD = round(sd(Score), 1))
## `summarise()` ungrouping output (override with `.groups` argument)
# start plot
ggplot(slrdata2, aes(Group, Score)) + 
  geom_boxplot(fill=c("orange", "darkgray")) +
  geom_text(data = means, aes(label = paste("M = ", Mean, sep = ""), y = 1)) +
  geom_text(data = means, aes(label = paste("SD = ", SD, sep = ""), y = 0)) +
  theme_bw(base_size = 15) +
  labs(x = "Group") +                      
  labs(y = "Test score (Points)", cex = .75) +   
  coord_cartesian(ylim = c(0, 20)) +  
  guides(fill = FALSE)                
Language test data

Language test data

The data indicate that group A did significantly better than group B. We will test this impression by generating the regression model and creating the model and extracting the model summary.

# generate regression model
m2.lm <- lm(Score ~ Group, data = slrdata2) 
# inspect results
summary(m2.lm)                             
## 
## Call:
## lm(formula = Score ~ Group, data = slrdata2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.767 -1.933  0.150  2.067  6.233 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  14.9333     0.5346  27.935 < 0.0000000000000002 ***
## GroupB       -3.1667     0.7560  -4.189            0.0000967 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.928 on 58 degrees of freedom
## Multiple R-squared:  0.2322, Adjusted R-squared:  0.219 
## F-statistic: 17.55 on 1 and 58 DF,  p-value: 0.00009669

The model summary reports that Groups B performed significantly (Pr(>|t|) is smaller than .001 as indicated by the three * after the p-values) comapred with Groups A (the Estmate is negative). We will now create the diagnostic graphics.

par(mfrow = c(1, 3))        # plot window: 1 plot/row, 3 plots/column
plot(resid(m2.lm))     # generate diagnostic plot
plot(rstandard(m2.lm)) # generate diagnostic plot
plot(rstudent(m2.lm)); par(mfrow = c(1, 1))  # restore normal plot window

The graphics do not indicate outliers or other issues, so we can continue with more diagnostic graphics.

par(mfrow = c(2, 2)) # generate a plot window with 2x2 panels
plot(m2.lm); par(mfrow = c(1, 1)) # restore normal plot window

These graphics also show no problems. In this case, the data can be summarized in the next step.

# tabulate results
slrresults2 <- slrsummary(m2.lm)
slrresults2
Results of the regression model.
Estimate Pearson’s r Std. Error t value Pr(>|t|) P-value sig.
(Intercept) 14.93 0.53 27.94 0 p < .001***
GroupB -3.17 0.48 0.76 -4.19 0.0001 p < .001***
Model statistics Value
Number of cases in model 60
Residual standard error on 58 DF 2.93
Multiple R-squared 0.2322
Adjusted R-squared 0.219
F-statistic (1, 58) 17.55
Model p-value 0.0001

The results of this second simple linear regressions can be summarized as follows:

A simple linear regression was fitted to the data. A visual assessment of the model diagnostics did not indicate any problematic or disproportionately influential data points (outliers). The final linear regression model is based on 60 data points, performed significantly better than an intercept-only base line model (F (1, 58): 17.55, p-value <. 001***), and reported that the model explained 21.9 percent of variance which confirmed a good model fit. According to this final model, group A scored significantly better on the language learning test than group B (coefficient: -3.17, SE: 0.48, t-value: -4.19, p-value <. 001 ***).

1.2 Multiple Linear Regression

In contrast to simple linear regression, which estimates the effect of a single predictor, multiple linear regression estimates the effect of various predictor (see the equation below). A multiple linear regression can thus test the effects of various predictors simultaneously.

\[\begin{equation} f_{(x)} = \alpha + \beta_{1}x_{i} + \beta_{2}x_{i+1} + \dots + \beta_{n}x_{i+n} + \epsilon \end{equation}\]

There exists a wealth of literature focusing on multiple linear regressions and the concepts it is based on. For instance, there are Achen (1982), Bortz (2006), Crawley (2005), Faraway (2002), Field, Miles, and Field (2012) (my personal favourite), Gries (2009), Levshina (2015), and Wilcox (2009) to name just a few. Introductions to regression modeling in R are Baayen (2008), Crawley (2012), Gries (2009), or Levshina (2015).

The model diagnostics we are dealing with here are partly identical to the diagnostic methods discussed in the section on simple linear regression. Because of this overlap, diagnostics will only be described in more detail if they have not been described in the section on simple linear regression.

A brief note on minimum necessary sample or data set size appears necessary here. Although there appears to be a general assumption that 25 data points per group are sufficient, this is not necessarily correct (it is merely a general rule of thumb that is actually often incorrect). Such rules of thumb are inadequate because the required sample size depends on the number of variables in a given model, the size of the effect and the variance of the effect. If a model contains many variables, then this requires a larger sample size than a model which only uses very few predictors. Also, to detect an effect with a very minor effect size, one needs a substantially larger sample compared to cases where the effect is very strong. In fact, when dealing with small effects, model require a minimum of 600 cases to reliably detect these effects. Finally, effects that very robust and do not vary much require a much smaller sample size compared with effects that are spurious and vary substantially. Since the sample size depends on the effect size and variance as well as the number of variables, there is no final one-size-fits-all answer to what the best sample size is.

Another, slightly better but still incorrect, rule of thumb is that the more data, the better. This is not correct because models based on too many cases are prone for overfitting and thus report correlations as being significant that are not. However, given that there are procedures that can correct for overfitting, larger data sets are still preferable to data sets that are simply too small to warrant reliable results. In conclusion, it remains true that the sample size depends on the effect under investigation.

Despite there being no ultimate rule of thumb, Field, Miles, and Field (2012) 273-275), based on Green (1991), provide data-driven suggestions for the minimal size of data required for regression models that aim to find medium sized effects (k = number of predictors; categorical variables with more than two levels should be transformed into dummy variables):

  • If one is merely interested in the overall model fit (something I have not encountered), then the sample size should be at least 50 + k (k = number of predictors in model).
  • If one is only interested in the effect of specific variables, then the sample size should be at least 104 + k (k = number of predictors in model).
  • If one is only interested in both model fit and the effect of specific variables, then the sample size should be at least the higher value of 50 + k or 104 + k (k = number of predictors in model).

You will see in the “R”-code below that there is already a function that tests whether the sample size is sufficient.

Example: Gifts and Availability

The example we will go through here is taken from Field, Miles, and Field (2012). In this example, the research question is if the money that men spend on presents for women depends on the women’s attractiveness and their relationship status. To answer this research question, we will implement a multiple linear regression and start by preparing the R-session (activating necessary packages, and loading functions).

# load libraries
library(boot)
library(Boruta)
library(car)
library(caret)
library(DT)
library(effects)
library(foreign)
library(ggplot2)
library(ggeffects)
library(gridExtra)
library(Hmisc)
library(knitr)
library(lme4)
library(MASS)
library(mlogit)
library(msm)
library(MuMIn)
library(nlme)
library(plyr)
library(QuantPsyc)
library(reshape2)
library(rms)
library(sandwich)
library(sfsmisc)
library(sjPlot)
library(stringr)
library(vcd)
library(visreg)
# load functions
source("https://slcladal.github.io/rscripts/blrsummary.r")
source("https://slcladal.github.io/rscripts/multiplot.r")
source("https://slcladal.github.io/rscripts/mlinrsummary.r")
source("https://slcladal.github.io/rscripts/SampleSizeMLR.r")
source("https://slcladal.github.io/rscripts/ExpR.r")

After preparing the session, we can now load the data and inspect its structure and properties.

# load data
mlrdata <- read.delim("https://slcladal.github.io/data/mlrdata.txt", header = TRUE)
# inspect data
datatable(mlrdata, rownames = FALSE, options = list(pageLength = 5, scrollX=T), filter = "none")

The data set consist of three variables stored in three columns. The first column contains the relationship status of the women, the second whether the man is interested in the woman, and the third column represents the money spend on the present. The data set represents 100 cases and the mean amount of money spend on a present is 88.38 dollars. In a next step, we visualize the data to get a more detailed impression of the relationships between variables.

# create plots
p1 <- ggplot(mlrdata, aes(status, money)) +   # data + x/y-axes
  geom_boxplot(fill=c("grey30", "grey70")) + # def. col.
  theme_bw(base_size = 8)+   # black and white theme
  labs(x = "") +                        # x-axis label
  labs(y = "Money spent on present (AUD)", cex = .75) +   # y-axis label
  coord_cartesian(ylim = c(0, 250)) +   # y-axis range
  guides(fill = FALSE) +                # no legend
  ggtitle("Status")                     # title
# plot 2
p2 <- ggplot(mlrdata, aes(attraction, money)) +
  geom_boxplot(fill=c("grey30", "grey70")) +
  theme_bw(base_size = 8) +
  labs(x = "") +                              # x-axis label
  labs(y = "Money spent on present (AUD)") +  # y-axis label
  coord_cartesian(ylim = c(0, 250)) +
  guides(fill = FALSE) +
  ggtitle("Attraction")
# plot 3
p3 <- ggplot(mlrdata, aes(x = money)) +
  geom_histogram(aes(y=..density..),    # add density statistic
                 binwidth = 10,         # def. bin width
                 colour = "black",      # def. bar edge colour
                 fill = "white") +      # def. bar col.
    theme_bw() +                        # black-white theme
  geom_density(alpha=.2, fill = "gray50") + # def. col. of overlay
    labs(x = "Money spent on present (AUD)") +
  labs(y = "Density of frequency")
# plot 4
p4 <- ggplot(mlrdata, aes(status, money)) +
  geom_boxplot(notch = F, aes(fill = factor(status))) + # create boxplot
  scale_fill_manual(values = c("grey30", "grey70")) +   # def. col. palette
  facet_wrap(~ attraction, nrow = 1) +  # separate panels for attraction
  theme_set(theme_bw(base_size = 8)) +
  labs(x = "") +
  labs(y = "Money spent on present (AUD)") +
  coord_cartesian(ylim = c(0, 250)) +
  guides(fill = FALSE)
# show plots
grid.arrange(grobs = list(p1, p2, p3, p4), widths = c(1, 1), layout_matrix = rbind(c(1, 2), c(3, 4)))

The upper left figure consists of a boxplot which shows how much money was spent based on the relationship status of the moan. The figure suggests that men spend more on women who are not in a relationship. The next figure shows the relationship between the money spend on presents and whether or not the men were interested in the women.

The boxplot in the upper right panel suggests that men spend substantially more on women if the men are interested in them. The next figure depicts the distribution of the amounts of money spend on women. In addition, the figure indicates the existence of two outliers (dots in the boxplot)

The histogram in the lower left panel shows that, although the mean amount of money spent on presents is 88.38 dollars, the distribution peaks around 50 dollars indicating that on average, men spend about 50 dollars on presents. Finally, we will plot the amount of money spend on presents against relationship status by attraction in order to check whether the money spent on presents is affected by an interaction between attraction and relationship status.

The boxplot in the lower right panel confirms the existence of an interaction (a non-additive term) as men only spend more money on single women if the men are interested in the women. If men are not interested in the women, then the relationship has no effect as they spend an equal amount of money on the women regardless of whether they are in a relationship or not.

We will now start to implement the regression model. In a first step, we create two saturated base-line models that contain all possible predictors (main effects and interactions). The two models are identical but one is generated with the “lm” and the other with the “glm” function as these functions offer different model parameters in their output.

m1.mlr = lm(                      # generate lm regression object
  money ~ 1 + attraction*status,  # def. rgression formula (1 = intercept)
  data = mlrdata)                 # def. data
m1.glm = glm(                     # generate glm regression object
  money ~ 1 + attraction*status,  # def. rgression formula (1 = intercept)
  family = gaussian,              # def. linkage function
  data = mlrdata)                 # def. data

After generating the saturated base-line models we can now start with the model fitting. Model fitting refers to a process that aims at find the model that explains a maximum of variance with a minimum of predictors (see Field, Miles, and Field 2012, 318). Model fitting is therefore based on the principle of parsimony which is related to Occam’s razor according to which explanations that require fewer assumptions are more likely to be true.

Automatic Model Fitting and Why You Should Not Use It

In this section, we will use a step-wise step-down procedure that uses decreases in AIC (Akaike information criterion) as the criterion to minimize the model in a step-wise manner. This procedure aims at finding the model with the lowest AIC values by evaluating - step-by-step - whether the removal of a predictor (term) leads to a lower AIC value.

We use this method here just so that you know it exists and how to implement it but you should rather avoid using automated model fitting. The reason for avoiding automated model fitting is that the algorithsm only checks if the AIC has decreased but not if the model is stable or reliable. Thus, automated model fitting has the problem that you can never be sure that the way that lead you to the final model is reliable and that all models were indeed stable. Imagine you want to climb down from a roof top and you have a ladder. The problem is that you do not know if and how many steps are broken. This is similar to using automated model fitting. In other sections, we will explore better methods to fit models (manual step-wise step-up and step-down procedures, for example).

The AIC is calculated using the equation below. The lower the AIC value, the better the balance between explained variance and the number of predictors. AIC values can and should only be compared for models that are fit on the same dataset with the same (number of) cases (\(LL\) stands for LogLikelihood and \(k\) represents the number of predictors in the model).

\[\begin{equation} -2LL + 2k \label{eq:aic} \end{equation}\]

Interactions are evaluated first and only if all interactions have been removed would the procedure start removing main effects. Other model fitting procedures (forced entry, step-wise step up, hierarchical) are discussed during the implementation of other regression models. We cannot discuss all procedures here as model fitting is rather complex and a discussion of even the most common procedures would to lengthy and time consuming at this point. It is important to note though that there is not perfect model fitting procedure and automated approaches should be handled with care as they are likely to ignore violations of model parameters that can be detected during manual - but time consuming - model fitting procedures. As a general rule of thumb, it is advisable to fit models as carefully and deliberately as possible. We will now begin to fit the model.

# automated AIC based model fitting
step(m1.mlr, direction = "both")
## Start:  AIC=592.52
## money ~ 1 + attraction * status
## 
##                     Df Sum of Sq   RSS    AIC
## <none>                           34558 592.52
## - attraction:status  1     24947 59505 644.86
## 
## Call:
## lm(formula = money ~ 1 + attraction * status, data = mlrdata)
## 
## Coefficients:
##                          (Intercept)               attractionNotInterested                          statusSingle  
##                                99.15                                -47.66                                 57.69  
## attractionNotInterested:statusSingle  
##                               -63.18

The automated model fitting procedure informs us that removing predictors ahs not caused a decrease in the AIC. The saturated model is thus also the final minimal adequate model. We will now inspect the final minimal model and go over the model report.

m2.mlr = lm(                       # generate lm regression object
  money ~ (status + attraction)^2, # def. regression formula
  data = mlrdata)                  # def. data
m2.glm = glm(                      # generate glm regression object
  money ~ (status + attraction)^2, # def. regression formula
  family = gaussian,               # def. linkage function
  data = mlrdata)                  # def. data
# inspect final minimal model
summary(m2.mlr)
## 
## Call:
## lm(formula = money ~ (status + attraction)^2, data = mlrdata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45.08 -14.26   0.46  11.93  44.14 
## 
## Coefficients:
##                                      Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)                            99.155      3.795  26.131 < 0.0000000000000002 ***
## statusSingle                           57.693      5.366  10.751 < 0.0000000000000002 ***
## attractionNotInterested               -47.663      5.366  -8.882   0.0000000000000375 ***
## statusSingle:attractionNotInterested  -63.179      7.589  -8.325   0.0000000000005809 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.97 on 96 degrees of freedom
## Multiple R-squared:  0.852,  Adjusted R-squared:  0.8474 
## F-statistic: 184.3 on 3 and 96 DF,  p-value: < 0.00000000000000022

The first element of the report is called Call and it reports the regression formula of the model. Then, the report provides the residual distribution (the range, median and quartiles of the residuals) which allows drawing inferences about the distribution of differences between observed and expected values. If the residuals are distributed unevenly, then this is a strong indicator that the model is unstable and unreliable because mathematical assumptions on which the model is based are violated.

Next, the model summary reports the most important part: a table with model statistics of the fixed-effects structure of the model. The table contains the estimates (coefficients of the predictors), standard errors, t-values, and the p-values which show whether a predictor significantly correlates with the dependent variable that the model investigates.

All main effects (status and attraction) as well as the interaction between status and attraction is reported as being significantly correlated with the dependent variable (money). An interaction occurs if a correlation between the dependent variable and a predictor is affect by another predictor.

The top most term is called intercept and has a value of 99.15 which represents the base estimate to which all other estimates refer. To exemplify what this means, let us consider what the model would predict a man would spend on a present for a women who is single but the man is not attracted to her: The amount he would spend (based on the model would be 99.15 dollars (the intercept) plus 57.69 dollars (because she is single) minus 47.66 dollars (because he is not interested in her) minus 63.18 dollars because of the interaction between status and attraction.

#intercept  Single  NotInterested  Single:NotInterested
99.15     + 57.69  + 0           + 0     # 156.8 single + interested
## [1] 156.84
99.15     + 57.69  - 47.66       - 63.18 # 46.00 single + not interested
## [1] 46
99.15     - 0      + 0           - 0     # 99.15 relationship + interested
## [1] 99.15
99.15     - 0      - 47.66       - 0     # 51.49 relationship + not interested
## [1] 51.49

Interestingly, the model predicts that a man would invest even less money in a woman that he is not interested in if she were single compared to being in a relationship! We can derive the same results easier using the “predict” function.

# make prediction based on the model for original data
prediction <- predict(m2.mlr, newdata = mlrdata)
# inspect predictions
table(round(prediction,2))
## 
##  46.01  51.49  99.15 156.85 
##     25     25     25     25

Below the table of coefficients, the regression summary reports model statistics that provide information about how well the model performs. The difference between the values and the values in the coefficients table is that the model statistics refer to the model as a whole rather than focusing on individual predictors.

The multiple R2-value is a measure of how much variance the model explains. A multiple R2-value of 0 would inform us that the model does not explain any variance while a value of .852 mean that the model explains 85.2 percent of the variance. A value of 1 would inform us that the model explains 100 percent of the variance and that the predictions of the model match the observed values perfectly. Multiplying the multiple R2-value thus provides the percentage of explained variance. Models that have a multiple R2-value equal or higher than .05 are deemed substantially significant (see Szmrecsanyi 2006, 55). It has been claimed that models should explain a minimum of 5 percent of variance but this is problematic as itis not uncommon for models to have very low explanatory power while still performing significantly and systematically better than chance. In addition, the total amount of variance is negligible in cases where one is interested in very weak but significant effects. It is much more important for model to perform significantly better than minimal base-line models because if this is not the case, then the model does not have any predictive and therefore no explanatory power.

The adjusted R2-value considers the amount of explained variance in light of the number of predictors in the model (it is thus somewhat similar to the AIC and BIC) and informs about how well the model would perform if it were applied to the population that the sample is drawn from. Ideally, the difference between multiple and adjusted R2-value should be very small as this means that the model is not overfitted. If, however, the difference between multiple and adjusted R2-value is substantial, then this would strongly suggest that the model is instable and overfitted to the data while being inadequate for drawing inferences about the population. Differences between multiple and adjusted R2-values indicate that the data contains outliers that cause the distribution of the data on which the model is based to differ from the distributions that the model mathematically requires to provide reliable estimates. The difference between multiple and adjusted R2-value in our model is very small (85.2-84.7=.05) and should not cause concern.

Before continuing, we will calculate the confidence intervals of the coefficients.

# extract confidence intervals of the coefficients
confint(m2.mlr)
##                                          2.5 %    97.5 %
## (Intercept)                           91.62258 106.68702
## statusSingle                          47.04063  68.34497
## attractionNotInterested              -58.31497 -37.01063
## statusSingle:attractionNotInterested -78.24324 -48.11436
# create and compare baseline- and minimal adequate model
m0.mlr <- lm(money ~1, data = mlrdata)
anova(m0.mlr, m2.mlr)
## Analysis of Variance Table
## 
## Model 1: money ~ 1
## Model 2: money ~ (status + attraction)^2
##   Res.Df    RSS Df Sum of Sq      F                Pr(>F)    
## 1     99 233562                                              
## 2     96  34558  3    199005 184.28 < 0.00000000000000022 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now, we compare the final minimal adequate model to the base-line model to test whether then final model significantly outperforms the baseline model.

# compare baseline- and minimal adequate model
Anova(m0.mlr, m2.mlr, type = "III")
## Anova Table (Type III tests)
## 
## Response: money
##             Sum Sq Df F value                Pr(>F)    
## (Intercept) 781016  1  2169.6 < 0.00000000000000022 ***
## Residuals    34558 96                                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The comparison between the two model confirms that the minimal adequate model performs significantly better (makes significantly more accurate estimates of the outcome variable) compared with the baseline model.

Outlier Detection

After implementing the multiple regression, we now need to look for outliers and perform the model diagnostics by testing whether removing data points disproportionately decreases model fit. To begin with, we generate diagnostic plots.

# start plotting
par(mfrow = c(2, 2)) # display plots in 3 rows/2 columns
plot(m2.mlr); par(mfrow = c(1, 1)) # cerate plots and restore original settings
## hat values (leverages) are all = 0.04
##  and there are no factor predictors; no plot no. 5

The plots do not show severe probelms such as funnel shaped patterns or drastic deviations from the diagonal line in Normal Q-Q plot (have a look at the explanation of what to look for and how to interpret these diagnostic plots in the section on simple linear regression) but data points 52, 64, and 83 are repeatedly indicated as potential outliers.

# determine a cutoff for data points that have D-values higher than 4/(n-k-1)
cutoff <- 4/((nrow(mlrdata)-length(m2.mlr$coefficients)-2))
# start plotting
par(mfrow = c(1, 2))           # display plots in 3 rows/2 columns
qqPlot(m2.mlr, main="QQ Plot") # create qq-plot
## [1] 52 83
plot(m2.mlr, which=4, cook.levels = cutoff); par(mfrow = c(1, 1))

The graphs indicate that data points 52, 64, and 83 may be problematic. We will therefore statistically evaluate whether these data points need to be removed. In order to find out which data points require removal, we extract the influence measure statistics and add them to out data set.

# extract influence statistics
infl <- influence.measures(m2.mlr)
# add infl. statistics to data
mlrdata <- data.frame(mlrdata, infl[[1]], infl[[2]])
# annotate too influential data points
remove <- apply(infl$is.inf, 1, function(x) {
  ifelse(x == TRUE, return("remove"), return("keep")) } )
# add annotation to data
mlrdata <- data.frame(mlrdata, remove)
# number of rows before removing outliers
nrow(mlrdata)
## [1] 100
# remove outliers
mlrdata <- mlrdata[mlrdata$remove == "keep", ]
# number of rows after removing outliers
nrow(mlrdata)
## [1] 98

The difference in row in the data set before and after removing data points indicate that two data points which represented outliers have been removed.

Rerun Regression

As we have a different data set now, we need to rerun the regression analysis. As the steps are identical to the regression analysis performed above, the steps will not be described in greater detail.

# recreate regression models on new data
m0.mlr = lm(money ~ 1, data = mlrdata)
m0.glm = glm(money ~ 1, family = gaussian, data = mlrdata)
m1.mlr = lm(money ~ (status + attraction)^2, data = mlrdata)
m1.glm = glm(money ~ status * attraction, family = gaussian,
             data = mlrdata)
# automated AIC based model fitting
step(m1.mlr, direction = "both")
## Start:  AIC=570.29
## money ~ (status + attraction)^2
## 
##                     Df Sum of Sq   RSS    AIC
## <none>                           30411 570.29
## - status:attraction  1     21647 52058 620.96
## 
## Call:
## lm(formula = money ~ (status + attraction)^2, data = mlrdata)
## 
## Coefficients:
##                          (Intercept)                          statusSingle               attractionNotInterested  
##                                99.15                                 55.85                                -47.66  
## statusSingle:attractionNotInterested  
##                               -59.46
# create new final models
m2.mlr = lm(money ~ (status + attraction)^2, data = mlrdata)
m2.glm = glm(money ~ status * attraction, family = gaussian,
             data = mlrdata)
# inspect final minimal model
summary(m2.mlr)
## 
## Call:
## lm(formula = money ~ (status + attraction)^2, data = mlrdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.764 -13.505  -0.989  10.599  38.772 
## 
## Coefficients:
##                                      Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)                            99.155      3.597  27.563 < 0.0000000000000002 ***
## statusSingle                           55.854      5.140  10.866 < 0.0000000000000002 ***
## attractionNotInterested               -47.663      5.087  -9.369  0.00000000000000404 ***
## statusSingle:attractionNotInterested  -59.461      7.269  -8.180  0.00000000000133752 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.99 on 94 degrees of freedom
## Multiple R-squared:  0.8574, Adjusted R-squared:  0.8528 
## F-statistic: 188.4 on 3 and 94 DF,  p-value: < 0.00000000000000022
# extract confidence intervals of the coefficients
confint(m2.mlr)
##                                          2.5 %    97.5 %
## (Intercept)                           92.01216 106.29744
## statusSingle                          45.64764  66.05943
## attractionNotInterested              -57.76402 -37.56158
## statusSingle:attractionNotInterested -73.89468 -45.02805
# compare baseline with final model
anova(m0.mlr, m2.mlr)
## Analysis of Variance Table
## 
## Model 1: money ~ 1
## Model 2: money ~ (status + attraction)^2
##   Res.Df    RSS Df Sum of Sq      F                Pr(>F)    
## 1     97 213227                                              
## 2     94  30411  3    182816 188.36 < 0.00000000000000022 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# compare baseline with final model
Anova(m0.mlr, m2.mlr, type = "III")
## Anova Table (Type III tests)
## 
## Response: money
##             Sum Sq Df F value                Pr(>F)    
## (Intercept) 760953  1  2352.1 < 0.00000000000000022 ***
## Residuals    30411 94                                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Additional Model Diagnostics

After rerunning the regression analysis on the updated data set, we again create diagnostic plots in order to check whether there are potentially problematic data points.

# start plotting
par(mfrow = c(2, 2)) # display plots in 2 rows/2 columns
plot(m2.mlr)         # plot fitted values

par(mfrow = c(1, 1)) # restore original settings
# determine a cutoff for data points that have
# D-values higher than 4/(n-k-1)
cutoff <- 4/((nrow(mlrdata)-length(m2.mlr$coefficients)-2))
# start plotting
par(mfrow = c(1, 2))           # display plots in 1 row/2 columns
qqPlot(m2.mlr, main="QQ Plot") # create qq-plot
## 84 88 
## 82 86
plot(m2.mlr, which=4, cook.levels = cutoff) # plot cook*s distance

par(mfrow = c(1, 1))           # restore original settings

Although the diagnostic plots indicate that additional points may be problematic, but these data points deviate substantially less from the trend than was the case with the data points that have already been removed. To make sure that retaining the data points that are deemed potentially problematic by the diagnostic plots, is acceptable, we extract diagnostic statistics and add them to the data.

# add model diagnostics to the data
mlrdata$residuals <- resid(m2.mlr)
mlrdata$standardized.residuals <- rstandard(m2.mlr)
mlrdata$studentized.residuals <- rstudent(m2.mlr)
mlrdata$cooks.distance <- cooks.distance(m2.mlr)
mlrdata$dffit <- dffits(m2.mlr)
mlrdata$leverage <- hatvalues(m2.mlr)
mlrdata$covariance.ratios <- covratio(m2.mlr)
mlrdata$fitted <- m2.mlr$fitted.values

We can now use these diagnostic statistics to create more precise diagnostic plots.

# plot 5
p5 <- ggplot(mlrdata,
             aes(studentized.residuals)) +
  theme(legend.position = "none") +
  theme_set(theme_bw(base_size = 8))+
  geom_histogram(aes(y=..density..),
                 binwidth = 1,
                 colour="black",
                 fill="white") +
  labs(x = "Studentized Residual", y = "Density") +
  stat_function(fun = dnorm,
                args = list(mean = mean(mlrdata$studentized.residuals, na.rm = TRUE),
                            sd = sd(mlrdata$studentized.residuals, na.rm = TRUE)),
                colour = "red", size = 1)
# plot 6
p6 <- ggplot(mlrdata, aes(fitted, studentized.residuals)) +
  geom_point() +
  geom_smooth(method = "lm", colour = "Red")+
  theme_bw(base_size = 8)+
  labs(x = "Fitted Values",
       y = "Studentized Residual")
# plot 7
p7 <- qplot(sample = mlrdata$studentized.residuals, stat="qq") +
  theme_bw(base_size = 8) +
  labs(x = "Theoretical Values",
       y = "Observed Values")
grid.arrange(p5, p6, p7, nrow = 1)

The new diagnostic plots do not indicate outliers that require removal. With respect to such data points the following parameters should be considered:

  1. Data points with standardised residuals > 3.29 should be removed (Field, Miles, and Field 2012, 269)

  2. If more than 1 percent of data points have standardized residuals exceeding values > 2.58, then the error rate of the model is inacceptable (Field, Miles, and Field 2012, 269).

  3. If more than 5 percent of data points have standardized residuals exceeding values > 1.96, then the error rate of the model is inacceptable (Field, Miles, and Field 2012, 269)

  4. In addition, data points with Cook’s D-values > 1 should be removed (Field, Miles, and Field 2012, 269)

  5. Also, data points with leverage values \(3(k + 1)/n\) (k = Number of predictors, N = Number of cases in model) should be removed (Field, Miles, and Field 2012, 270)

  6. There should not be (any) autocorrelation among predictors. This means that independent variables cannot be correlated with itself (for instance, because data points come from the same subject). If there is autocorrelation among predictors, then a Repeated Measures Design or a (hierarchical) mixed-effects model should be implemented instead.

  7. Predictors cannot substantially correlate with each other (multicollinearity). If a model contains predictors that have variance inflation factors (VIF) > 10 the model is completely unreliable (Myers 1990) and predictors causing such VIFs should be removed. Indeed, even VIFs of 2.5 can be problematic (Szmrecsanyi 2006, 215, @zuur2010protocol) proposes that variables with VIFs exceeding 3 should be removed!

  8. Data points with 1/VIF values \(<\) .1 must be removed (data points with values above .2 are considered problematic) (Menard 1995).

  9. The mean value of VIFs should be \(<\) 1 (Bowerman and O’Connell 1990).

# 1: optimal = 0
# (aufgelistete datenpunkte sollten entfernt werden)
which(mlrdata$standardized.residuals > 3.29)
## integer(0)
# 2: optimal = 1
# (listed data points should be removed)
stdres_258 <- as.vector(sapply(mlrdata$standardized.residuals, function(x) {
ifelse(sqrt((x^2)) > 2.58, 1, 0) } ))
(sum(stdres_258) / length(stdres_258)) * 100
## [1] 0
# 3: optimal = 5
# (listed data points should be removed)
stdres_196 <- as.vector(sapply(mlrdata$standardized.residuals, function(x) {
ifelse(sqrt((x^2)) > 1.96, 1, 0) } ))
(sum(stdres_196) / length(stdres_196)) * 100
## [1] 6.122449
# 4: optimal = 0
# (listed data points should be removed)
which(mlrdata$cooks.distance > 1)
## integer(0)
# 5: optimal = 0
# (data points should be removed if cooks distance is close to 1)
which(mlrdata$leverage >= (3*mean(mlrdata$leverage)))
## integer(0)
# 6: checking autocorrelation:
# Durbin-Watson test (optimal: grosser p-wert)
dwt(m2.mlr)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.01433247      1.968042   0.654
##  Alternative hypothesis: rho != 0
# 7: test multicolliniarity 1
vif(m2.mlr)
##                         statusSingle              attractionNotInterested statusSingle:attractionNotInterested 
##                                 2.00                                 1.96                                 2.96
# 8: test multicolliniarity 2
1/vif(m2.mlr)
##                         statusSingle              attractionNotInterested statusSingle:attractionNotInterested 
##                            0.5000000                            0.5102041                            0.3378378
# 9: mean vif should not exceed 1
mean(vif(m2.mlr))
## [1] 2.306667

Except for the mean VIF value (2.307) which should not exceed 1, all diagnostics are acceptable. We will now test whether the sample size is sufficient for our model. With respect to the minimal sample size and based on (Green 1991), (Field, Miles, and Field 2012, 273–74) offer the following rules of thumb (k = number of predictors; categorical predictors with more than two levels should be recoded as dummy variables):

Evaluation of Sample Size

After performing the diagnostics, we will now test whether the sample size is adequate and what the values of “R” would be based on a random distribution in order to be able to estimate how likely a \(\beta\)-error is given the present sample size (see Field, Miles, and Field 2012, 274). Beta errors (or \(\beta\)-errors) refer to the erroneous assumption that a predictor is not significant (based on the analysis and given the sample) although it does have an effect in the population. In other words, \(\beta\)-error means to overlook a significant effect because of weaknesses of the analysis. The test statistics ranges between 0 and 1 where lower values are better. If the values approximate 1, then there is serious concern as the model is not reliable given the sample size. In such cases, unfortunately, the best option is to increase the sample size.

# check if sample size is sufficient
smplesz(m2.mlr)
## [1] "Sample too small: please increase your sample by  9  data points"
# check beta-error likelihood
expR(m2.mlr)
## [1] "Based on the sample size expect a false positive correlation of 0.0309 between the predictors and the predicted"

The function “smplesz” reports that the sample size is insufficient by 9 data points according to Green (1991). The likelihood of \(\beta\)-errors, however, is very small (0.0309). As a last step, we summarize the results of the regression analysis.

tab_model(m0.glm, m2.glm)
  money money
Predictors Estimates CI p Estimates CI p
(Intercept) 88.12 78.72 – 97.52 <0.001 99.15 92.10 – 106.21 <0.001
status [Single] 55.85 45.78 – 65.93 <0.001
attraction
[NotInterested]
-47.66 -57.63 – -37.69 <0.001
status [Single] *
attraction
[NotInterested]
-59.46 -73.71 – -45.21 <0.001
Observations 98 98
R2 Nagelkerke 0.000 1.000

Note: The R2 values in this report is incorrect! As we have seen above the correct R2 values are: multiple RR2 0.8574, Adjusted R2 0.8528:

summary(m2.mlr)
## 
## Call:
## lm(formula = money ~ (status + attraction)^2, data = mlrdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.764 -13.505  -0.989  10.599  38.772 
## 
## Coefficients:
##                                      Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)                            99.155      3.597  27.563 < 0.0000000000000002 ***
## statusSingle                           55.854      5.140  10.866 < 0.0000000000000002 ***
## attractionNotInterested               -47.663      5.087  -9.369  0.00000000000000404 ***
## statusSingle:attractionNotInterested  -59.461      7.269  -8.180  0.00000000000133752 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.99 on 94 degrees of freedom
## Multiple R-squared:  0.8574, Adjusted R-squared:  0.8528 
## F-statistic: 188.4 on 3 and 94 DF,  p-value: < 0.00000000000000022

Although Field, Miles, and Field (2012) suggest that the main effects of the predictors involved in the interaction should not be interpreted, they are interpreted here to illustrate how the results of a multiple linear regression can be reported. Accordingly, the results of the regression analysis performed above can be summarized as follows:

A multiple linear regression was fitted to the data using an automated, step-wise, AIC-based (Akaike’s Information Criterion) procedure. The model fitting arrived at a final minimal model. During the model diagnostics, two outliers were detected and removed. Further diagnostics did not find other issues after the removal.

The final minimal adequate regression model is based on 98 data points and performs highly significantly better than a minimal baseline model (Multiple R2: .857, Adjusted R2: .853, F-statistic (3, 94): 154.4, AIC: 850.4, BIC: 863.32, p<.001\(***\)). The final minimal adequate regression model reports attraction and status as significant main effects. The relationship status of women correlates highly significantly and positively with the amount of money spend on the women’s presents (SE: 5.14, t-value: 10.87, p<.001\(***\)). This shows that men spend 156.8 dollars on presents are single while they spend 99,15 dollars if the women are in a relationship. Whether men are attracted to women also correlates highly significantly and positively with the money they spend on women (SE: 5.09, t-values: -9.37, p<.001\(***\)). If men are not interested in women, they spend 47.66 dollar less on a present for women compared with women the men are interested in.

Furthermore, the final minimal adequate regression model reports a highly significant interaction between relationship status and attraction (SE: 7.27, t-value: -8.18, p<.001\(***\)): If women are single but man are not interested in them, men spend 59.46 dollars less on their presents compared to all other constellations.

1.3 Multiple Binomial Logistic Regression

Logistic regression is a multivariate analysis technique that builds on and is very similar in terms of its implementation to linear regression but logistic regressions take dependent variables that represent nominal rather than numeric scaling (Harrell Jr 2015). The difference requires that the linear regression must be modified in certain ways to avoid producing non-sensical outcomes. The most fundamental difference between logistic and linear regressions is that logistic regression work on the probabilities of an outcome (the likelihood), rather than the outcome itself. In addition, the likelihoods on which the logistic regression works must be logged (logarithmized) in order to avoid produce predictions that produce values greater than 1 (instance occurs) and 0 (instance does not occur).

To understand what this mean, we will use a very simple example. In this example, we want to see whether the height of men affect their likelihood of being in a relationship. The data we use represents a data set consisting of two variables: height and relationship.

The left panel of the Figure above shows that a linear model would predict values for the relationship status, which represents a factor (0 = Not in Relationship and 1 = In Relationship), that are non-sensical because 1.1 does not make sense if the only options are 0 OR 1. The logistic function shown in the right panel of the Figure above solves this problem by working on the logged probabilities of an outcome rather than on the actual outcome.

Example 1: EH in Kiwi English

To exemplify hot to implement a logistic regression in R (see Agresti 1996, @agresti2011categorical for very good and thorough introductions to this topic), we will analyse the use of the discourse particle eh in New Zealand English and test which factors correlate with its occurrence. The data set represents speech units in a corpus that were coded for the speaker who uttered a given speech unit, the gender, ethnicity, and age of that speaker and whether or not the speech unit contained an eh. To begin with, we clean the current work space, set option, install and activate relevant packages, load customized functions, and load the example data set.

# load data
blrdata <- read.table("https://slcladal.github.io/data/blrdata.txt",
                      comment.char = "",  # data does not contain comments
                      quote = "",         # data does not contain quotes
                      sep = "\t",         # data is tab separetd
                      header = T)         # variables have headers
# inspect data
datatable(blrdata, rownames = FALSE, options = list(pageLength = 5, scrollX=T), filter = "none")

The summary of the data show that the data set contains 25,821 observations of five variables. The variable “ID” contains strings that represent a combination file and speaker of a speech unit. The second variable represents the gender, the third the age, and the fourth the ethnicity of speakers. The fifth variable represents whether or not a speech unit contained the discourse particle eh.

Next, we factorize the variables in our data set. In other words, we specify that the strings represent variable levels and define new reference levels because as a default “R” will use the variable level which first occurs in alphabet ordering as the reference level for each variable, we redefine the variable levels for Age and Ethnicity.

vrs <- c("Age", "Gender", "Ethnicity", "ID")  # define variables to be factorized
fctr <- which(colnames(blrdata) %in% vrs)     # define vector with variables
blrdata[,fctr] <- lapply(blrdata[,fctr], factor) # factorize variables
blrdata$Age <- relevel(blrdata$Age, "Young") # relevel Age (Young = Reference)
blrdata$Ethnicity <- relevel(                # relevel Ethnicity
  blrdata$Ethnicity, "Pakeha") # define Pakeha as Reference level)

After preparing the data, we will now plot the data to get an overview of potential relationships between variables.

ggplot(blrdata, aes(Age, EH, color = Gender)) +
  facet_wrap(~Ethnicity) +
  stat_summary(fun = mean, geom = "point") +
  stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.2) +
  theme_set(theme_bw(base_size = 10)) +
  theme(legend.position = "top") +
  labs(x = "", y = "Observed Probabilty of eh") +
  scale_color_manual(values = c("gray20", "gray70"))

With respect to main effects, the Figure above indicates that men use eh more frequently than women, that young speakers sue it more frequently compared with old speakers, and that speakers that are descendants of European settlers (Pakeha) use eh more frequently compared with Maori (the native inhabitants of New Zealand).

The plots in the lower panels do not indicate significant interactions between use of eh and the Age, Gender, and Ethnicity of speakers. In a next step, we will start building the logistic regression model.

Model Building

As a first step, we need to define contrasts and add a distance matrix to the options. Contrasts define what and how variable levels should be compared and therefore influences how the results of the regression analysis are presented.

# set contrasts
options(contrasts  =c("contr.treatment", "contr.poly"))
# create distance matrix
blrdata.dist <- datadist(blrdata)
# include distance matrix in options
options(datadist = "blrdata.dist")

Next, we generate a minimal model that predicts the use of eh solely based on the intercept.

# baseline glm model
m0.glm = glm(EH ~ 1, family = binomial, data = blrdata)

Model fitting

We will now start with the model fitting procedure. In the present case, we will use a manual step-wise step-up procedure during which predictors are added to the model if they significantly improve the model fit. In addition, we will perform diagnostics as we fit the model ateach step of the model fitting process rather than after the fitting.

We will test two things in particular: whether the data has incomplete information or complete separation and if the model suffers from multicollinearity.

Incomplete information or complete separation means that the data does not contain all combinations of the predictor or the dependent variable. This is important because if the data does not contain cases of all combinations, the model will assume that it has found a perfect predictor. In such cases the model overestimates the effect of that that predictor and the results of that model are no longer reliable. For example, if “EH” was only used by young speakers in the data, the model would jump on that fact and say “Ha! If there is an old speaker, that means that that speaker will never ever and under no circumstances say”EH" - I can therefore ignore all other factors!"

Multicollinearity means that predictors correlate and have shared variance. This means that whichever predictor is included first will take all the variance that it can explain and the remaining part of the variable that is shared will not be attributed to the other predictor. This may lead to reporting that a factor is not significant because all of the variance it can explain is already accounted for. However, if the other predictor were included first, then the original predictor would be returned as insignificant. This means that- depending on the order in which predictors are added - the results of the regression can differ dramatically and the model is therefore not reliable. Multicollinearity is actually a very common problem and there are various ways to deal with it but it cannot be ignored (at least in regression analyses).

We will start by adding “Age” to the minimal adequate model.

# check incomplete information
ifelse(min(ftable(blrdata$Age, blrdata$EH)) == 0, "not possible", "possible")
## [1] "possible"
# add age to the model
m1.glm = glm(EH ~ Age, family = binomial, data = blrdata)
# check multicollinearity (vifs should have values of 3 or lower for main effects)
ifelse(max(vif(m1.glm)) <= 3,  "vifs ok", "WARNING: high vifs!") # VIFs ok
## [1] "vifs ok"
# check if adding Age significantly improves model fit
anova(m1.glm, m0.glm, test = "Chi")
## Analysis of Deviance Table
## 
## Model 1: EH ~ Age
## Model 2: EH ~ 1
##   Resid. Df Resid. Dev Df Deviance              Pr(>Chi)    
## 1     25819      32377                                      
## 2     25820      33008 -1  -630.89 < 0.00000000000000022 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As the data does not contain incomplete information, the vif values are below 3, and adding “Age” has significantly imporved the mdel fit (the p-value of the anova is lower than .05). We therefore proceed with “Age” included.

We continue by adding “Gender”. We add a second ANOVA test to see if including Gender affects the significance of other predictors in the model. If this were the case - if adding Gender woudl cause Age to become insignificant - then we could change the ordering in which we include predictors into our model.

ifelse(min(ftable(blrdata$Gender, blrdata$EH)) == 0, "not possible", "possible")
## [1] "possible"
m2.glm <- update(m1.glm, . ~ . +Gender)
ifelse(max(vif(m2.glm)) <= 3,  "vifs ok", "WARNING: high vifs!") # VIFs ok
## [1] "vifs ok"
anova(m2.glm, m1.glm, test = "Chi")
## Analysis of Deviance Table
## 
## Model 1: EH ~ Age + Gender
## Model 2: EH ~ Age
##   Resid. Df Resid. Dev Df Deviance              Pr(>Chi)    
## 1     25818      32140                                      
## 2     25819      32377 -1  -237.32 < 0.00000000000000022 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(m2.glm, test = "LR")
## Analysis of Deviance Table (Type II tests)
## 
## Response: EH
##        LR Chisq Df            Pr(>Chisq)    
## Age      668.64  1 < 0.00000000000000022 ***
## Gender   237.32  1 < 0.00000000000000022 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Again, including “Gender” significantly improves model fit and the data does not contain incomplete information or complete separation. Also, including “Gender” does not affect the significance of “Age”. Now, we include “Ethnicity”.

ifelse(min(ftable(blrdata$Ethnicity, blrdata$EH)) == 0, "not possible", "possible")
## [1] "possible"
m3.glm <- update(m2.glm, . ~ . +Ethnicity)
ifelse(max(vif(m3.glm)) <= 3,  "vifs ok", "WARNING: high vifs!") # VIFs ok
## [1] "vifs ok"
anova(m3.glm, m2.glm, test = "Chi")
## Analysis of Deviance Table
## 
## Model 1: EH ~ Age + Gender + Ethnicity
## Model 2: EH ~ Age + Gender
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1     25817      32139                     
## 2     25818      32140 -1 -0.26101   0.6094

Since adding “Ethnicity” does not significantly imporve the mdoel fit, we do not need to test if its inclusion affects the significance of other predictors. We continue without “Ethnicity” and include the interaction between “Age” and “Gender”.

ifelse(min(ftable(blrdata$Age, blrdata$Gender, blrdata$EH)) == 0, "not possible", "possible")
## [1] "possible"
m4.glm <- update(m2.glm, . ~ . +Age*Gender)
ifelse(max(vif(m4.glm)) <= 3,  "vifs ok", "WARNING: high vifs!") # VIFs ok
## [1] "vifs ok"
anova(m4.glm, m2.glm, test = "Chi")
## Analysis of Deviance Table
## 
## Model 1: EH ~ Age + Gender + Age:Gender
## Model 2: EH ~ Age + Gender
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1     25817      32139                     
## 2     25818      32140 -1 -0.12424   0.7245

The interaction between Age and Gender is not significant which means that men and women do not behave differently with respect to their use of “EH” as they age. Also, the data does not contain incomplete information and the model does not suffer from multicollinerity - the predictors are not collinear. We can now include if there is a significant interaction between “Age” and “Ethnicity”.

ifelse(min(ftable(blrdata$Age, blrdata$Ethnicity, blrdata$EH)) == 0, "not possible", "possible")
## [1] "possible"
m5.glm <- update(m2.glm, . ~ . +Age*Ethnicity)
ifelse(max(vif(m5.glm)) <= 3,  "vifs ok", "WARNING: high vifs!") # VIFs ok
## [1] "vifs ok"
anova(m5.glm, m2.glm, test = "Chi")
## Analysis of Deviance Table
## 
## Model 1: EH ~ Age + Gender + Ethnicity + Age:Ethnicity
## Model 2: EH ~ Age + Gender
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1     25816      32136                     
## 2     25818      32140 -2  -3.0686   0.2156

Again, no incomplete information or multicollinearity and no significant interaction. Now, we test if there exists a significant interaction between “Gender” and “Ethnicity”.

ifelse(min(ftable(blrdata$Gender, blrdata$Ethnicity, blrdata$EH)) == 0, "not possible", "possible")
## [1] "possible"
m6.glm <- update(m2.glm, . ~ . +Gender*Ethnicity)
ifelse(max(vif(m6.glm)) <= 3,  "vifs ok", "WARNING: high vifs!") # VIFs ok
## [1] "vifs ok"
anova(m6.glm, m2.glm, test = "Chi")
## Analysis of Deviance Table
## 
## Model 1: EH ~ Age + Gender + Ethnicity + Gender:Ethnicity
## Model 2: EH ~ Age + Gender
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1     25816      32139                     
## 2     25818      32140 -2 -0.27225   0.8727

As the interaction between “Gender” and “Ethnicity” is not significant, we continue without it. In a final step, we include the three-way interaction between “Age”, “Gender”, and “Ethnicity”.

ifelse(min(ftable(blrdata$Age, blrdata$Gender, blrdata$Ethnicity, blrdata$EH)) == 0, "not possible", "possible")
## [1] "possible"
m7.glm <- update(m2.glm, . ~ . +Gender*Ethnicity)
ifelse(max(vif(m7.glm)) <= 3,  "vifs ok", "WARNING: high vifs!") # VIFs ok
## [1] "vifs ok"
anova(m7.glm, m2.glm, test = "Chi")
## Analysis of Deviance Table
## 
## Model 1: EH ~ Age + Gender + Ethnicity + Gender:Ethnicity
## Model 2: EH ~ Age + Gender
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1     25816      32139                     
## 2     25818      32140 -2 -0.27225   0.8727

We have found our final minimal adequate model because the 3-way interaction is also insignificant. As we have now arrived at the final minimal adequate model (m2.glm), we generate a final minimal model using the “lrm” model.

m2.lrm <- lrm(EH ~ Age+Gender, data = blrdata, x = T, y = T, linear.predictors = T)
m2.lrm
## Logistic Regression Model
##  
##  lrm(formula = EH ~ Age + Gender, data = blrdata, x = T, y = T, 
##      linear.predictors = T)
##  
##                         Model Likelihood    Discrimination    Rank Discrim.    
##                               Ratio Test           Indexes          Indexes    
##  Obs         25821    LR chi2     868.21    R2       0.046    C       0.602    
##   0          17114    d.f.             2    g        0.432    Dxy     0.203    
##   1           8707    Pr(> chi2) <0.0001    gr       1.541    gamma   0.302    
##  max |deriv| 3e-10                          gp       0.091    tau-a   0.091    
##                                             Brier    0.216                     
##  
##               Coef    S.E.   Wald Z Pr(>|Z|)
##  Intercept    -0.2324 0.0223 -10.44 <0.0001 
##  Age=Old      -0.8305 0.0335 -24.78 <0.0001 
##  Gender=Women -0.4201 0.0273 -15.42 <0.0001 
## 
anova(m2.lrm)
##                 Wald Statistics          Response: EH 
## 
##  Factor     Chi-Square d.f. P     
##  Age        614.04     1    <.0001
##  Gender     237.65     1    <.0001
##  TOTAL      802.65     2    <.0001

After fitting the model, we validate the model to avoid arriving at a final minimal model that is overfitted to the data at hand.

Model Validation

To validate a model, you can apply the “validate” function and apply it to a saturated model. The output of the “validate” function shows how often predictors are retained if the sample is re-selected with the same size but with placing back drawn data points. The execution of the function requires some patience as it is rather computationally expensive and it is, tehrefore, commented out below.

# model validation (remove # to activate: output too long for website)
m7.lrm <- lrm(EH ~ (Age+Gender+Ethnicity)^3, data = blrdata, x = T, y = T, linear.predictors = T)
#validate(m7.lrm, bw = T, B = 200)

The “validate” function shows that retaining two predictors (Age and Gender) is the best option and thereby confirms our final minimal adequate model as the best minimal model. In addition, we check whether we need to include a penalty for data points because they have too strong of an impact of the model fit. To see whether a penalty is warranted, we apply the “pentrace” function to the final minimal adequate model.

pentrace(m2.lrm, seq(0, 0.8, by = 0.05)) # determine penalty
## 
## Best penalty:
## 
##  penalty       df
##      0.8 1.999254
## 
##  penalty       df      aic      bic    aic.c
##     0.00 2.000000 864.2138 847.8959 864.2133
##     0.05 1.999953 864.2139 847.8964 864.2134
##     0.10 1.999907 864.2140 847.8969 864.2135
##     0.15 1.999860 864.2141 847.8973 864.2136
##     0.20 1.999813 864.2142 847.8978 864.2137
##     0.25 1.999767 864.2143 847.8983 864.2138
##     0.30 1.999720 864.2143 847.8987 864.2139
##     0.35 1.999674 864.2144 847.8992 864.2140
##     0.40 1.999627 864.2145 847.8997 864.2140
##     0.45 1.999580 864.2146 847.9001 864.2141
##     0.50 1.999534 864.2147 847.9006 864.2142
##     0.55 1.999487 864.2148 847.9011 864.2143
##     0.60 1.999440 864.2148 847.9015 864.2144
##     0.65 1.999394 864.2149 847.9020 864.2144
##     0.70 1.999347 864.2150 847.9024 864.2145
##     0.75 1.999301 864.2151 847.9029 864.2146
##     0.80 1.999254 864.2151 847.9033 864.2147

The values are so similar that a penalty is unnecessary. In a next step, we rename the final models.

lr.glm <- m2.glm  # rename final minimal adequate glm model
lr.lrm <- m2.lrm  # rename final minimal adequate lrm model

Now, we calculate a Model Likelihood Ratio Test to check if the final model performs significantly better than the initial minimal base-line model. The result of this test is provided as a default if we call a summary of the lrm object.

modelChi <- lr.glm$null.deviance - lr.glm$deviance
chidf <- lr.glm$df.null - lr.glm$df.residual
chisq.prob <- 1 - pchisq(modelChi, chidf)
modelChi; chidf; chisq.prob
## [1] 868.2138
## [1] 2
## [1] 0

The code above provides three values: a \(\chi\)2, the degrees of freedom, and a p-value. The p-value is lower than .05 and the results of the Model Likelihood Ratio Test therefore confirm that the final minimal adequate model performs significantly better than the initial minimal base-line model. Another way to extract the model likelihood test statistics is to use an ANOVA to compare the final minimal adequate model to the minimal base-line model.

A handier way to get thses statistics is by performing an ANOVA on the final minimal model which, if used this way, is identical to a Model Likelihood Ratio test.

anova(m0.glm, lr.glm, test = "Chi") # Model Likelihood Ratio Test
## Analysis of Deviance Table
## 
## Model 1: EH ~ 1
## Model 2: EH ~ Age + Gender
##   Resid. Df Resid. Dev Df Deviance              Pr(>Chi)    
## 1     25820      33008                                      
## 2     25818      32140  2   868.21 < 0.00000000000000022 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In a next step, we calculate pseudo-R2 values which represent the amount of residual variance that is explained by the final minimal adequate model. We cannot use the ordinary R2 because the model works on the logged likelihoods rather than the values of the dependent variable.

# calculate pseudo R^2
# number of cases
ncases <- length(fitted(lr.glm))
R2.hl <- modelChi/lr.glm$null.deviance
R.cs <- 1 - exp ((lr.glm$deviance - lr.glm$null.deviance)/ncases)
R.n <- R.cs /( 1- ( exp (-(lr.glm$null.deviance/ ncases))))
# function for extracting pseudo-R^2
logisticPseudoR2s <- function(LogModel) {
  dev <- LogModel$deviance
    nullDev <- LogModel$null.deviance
    modelN <-  length(LogModel$fitted.values)
    R.l <-  1 -  dev / nullDev
    R.cs <- 1- exp ( -(nullDev - dev) / modelN)
    R.n <- R.cs / ( 1 - ( exp (-(nullDev / modelN))))
    cat("Pseudo R^2 for logistic regression\n")
    cat("Hosmer and Lemeshow R^2  ", round(R.l, 3), "\n")
    cat("Cox and Snell R^2        ", round(R.cs, 3), "\n")
    cat("Nagelkerke R^2           ", round(R.n, 3),    "\n") }
logisticPseudoR2s(lr.glm)
## Pseudo R^2 for logistic regression
## Hosmer and Lemeshow R^2   0.026 
## Cox and Snell R^2         0.033 
## Nagelkerke R^2            0.046

The low pseudo-R2 values show that our model has very low explanatory power as it only accounts for approximately 2.6 to 4.6 percent of the variance in the logged likelihoods (to get the percentages, you simply multiply the pseudo-R2 values by 100). Next, we extract the confidence intervals for the coefficients of the model.

# extract the confidence intervals for the coefficients
confint(lr.glm)
## Waiting for profiling to be done...
##                  2.5 %     97.5 %
## (Intercept) -0.2760509 -0.1887787
## AgeOld      -0.8964864 -0.7650958
## GenderWomen -0.4735310 -0.3667038

Despite having low explanatory and predictive power, the age of speakers and their gender are significant as the confidence intervals of the coefficients do not overlap with 0.

Effect Size

In a next step, we compute odds ratios and their confidence intervals. Odds Ratios represent a common measure of effect size and can be used to compare effect sizes across models. Odds ratios rang between 0 and infinity. Values of 1 indicate that there is no effect. The further away the values are from 1, the stronger the effect. If the values are lower than 1, then the variable level correlates negatively with the occurrence of the outcome (the likelihood decreases) while values above 1 indicate a positive correlation and show that the variable level causes an increase in the likelihood of the outcome (the occurrence of EH).

exp(lr.glm$coefficients) # odds ratios
## (Intercept)      AgeOld GenderWomen 
##   0.7926425   0.4358154   0.6569723
exp(confint(lr.glm))     # confidence intervals of the coefficients
## Waiting for profiling to be done...
##                 2.5 %    97.5 %
## (Intercept) 0.7587743 0.8279697
## AgeOld      0.4080007 0.4652893
## GenderWomen 0.6227993 0.6930149

The odds ratios confirm that older speakers use eh significantly less often compared with younger speakers and that women use eh less frequently than men as the confidence intervals of the odds rations do not overlap with 1. In a next step, we calculate the prediction accuracy of the model.

Prediction Accuracy

In order to calculate the prediction accuracy of the model, we rearrange the data so that it does not reflect one speech unit per row but the number of speech units with EH and the number of speech units without eh per speaker! Thus, we transform the data into a per speaker rather than a per speech-unit format.

blrdata_byspeaker <- table(blrdata$ID, blrdata$EH)
blrdata_byspeaker <- data.frame(rownames(blrdata_byspeaker), blrdata_byspeaker[, 1], blrdata_byspeaker[, 2])
names(blrdata_byspeaker) <- c("ID", "NOEH", "EH")
rownames(blrdata_byspeaker) <- 1:length(blrdata_byspeaker[,1])

blrdata_byspeaker <- plyr::join(blrdata_byspeaker,  # join by-speaker data and biodata
                          blrdata, by = "ID", # join by ID
                          type = "left",      # only speakers for which bio data is provided
                          match = "first")    #
blrdata_byspeaker$EH <- NULL                  # remove EH column
# inspect data
datatable(blrdata_byspeaker, rownames = FALSE, options = list(pageLength = 5, scrollX=T), filter = "none")
# use by.spk data to fit another model which we will use to test the accuracy of the model
lr.glm.spk <- glm(cbind(EH, NOEH) ~ Age*Gender + Ethnicity + Age:Ethnicity, data = blrdata_byspeaker, family = binomial)
correct <- sum(blrdata_byspeaker$EH * (predict(lr.glm.spk, type = "response") >= 0.5)) + sum(blrdata_byspeaker$NOEH * (predict(lr.glm.spk, type="response") < 0.5))
tot <- sum(blrdata_byspeaker$EH) + sum(blrdata_byspeaker$NOEH)
predict.acc <- (correct/tot)*100
predict.acc
## [1] 99.60424

The models predicts 99.6 of cases accurately which appears to be a satisfactory result but in order to evaluate the prediction accuracy, we need to compare it to the accuracy of the minimal base-line model.

# extract prediction accuracy
lr.glm.spk.base <- glm(cbind(EH, NOEH) ~ 1, data = blrdata_byspeaker, family = binomial)
correct.b <- sum(blrdata_byspeaker$EH * (predict(lr.glm.spk.base, type = "response") >= 0.5)) + sum(blrdata_byspeaker$NOEH * (predict(lr.glm.spk.base, type="response") < 0.5))
tot.b <- sum(blrdata_byspeaker$EH) + sum(blrdata_byspeaker$NOEH)
predict.acc.base <- (correct.b/tot.b)*100
# inspect prediction accuracy
predict.acc.base
## [1] 99.60424

Both, the final-minimal and the minimal base-line model have the same prediction accuracy. This is interesting and we need to determine why this is the case. We will extract the predictions based on both models to find out why the predictions are identical.

# compare preictions of final and base line model
which(lr.glm.spk$fitted > .5)
## named integer(0)
which(lr.glm.spk.base$fitted > .5)
## named integer(0)

The reason why both models arrive at the same predictions is that because both models always predict an absence of EH.

# create variable with contains the prediction of the model
blrdata$Prediction <- predict(lr.glm, blrdata, type = "response")
blrdata$Prediction <- ifelse(blrdata$Prediction > .5, 1, 0)
# convert predicted and observed into factors with the same levels
blrdata$Prediction <- factor(blrdata$Prediction, levels = c("0", "1"))
blrdata$EH <- factor(blrdata$EH, levels = c("0", "1"))
# create a confusion matrix with compares observed against predicted values
caret::confusionMatrix(blrdata$Prediction, blrdata$EH)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 17114  8707
##          1     0     0
##                                              
##                Accuracy : 0.6628             
##                  95% CI : (0.657, 0.6686)    
##     No Information Rate : 0.6628             
##     P-Value [Acc > NIR] : 0.5029             
##                                              
##                   Kappa : 0                  
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 1.0000             
##             Specificity : 0.0000             
##          Pos Pred Value : 0.6628             
##          Neg Pred Value :    NaN             
##              Prevalence : 0.6628             
##          Detection Rate : 0.6628             
##    Detection Prevalence : 1.0000             
##       Balanced Accuracy : 0.5000             
##                                              
##        'Positive' Class : 0                  
## 

We can use the sjPlot package to visualize the effects.

efp1 <- plot_model(lr.glm, type = "pred", terms = c("Age"), axis.lim = c(0, 1)) # predicted probability
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
efp2 <- plot_model(lr.glm, type = "pred", terms = c("Gender")) # predicted percentage
grid.arrange(efp1, efp2, nrow = 1)

A more intuitive way to visualize results id to plot the predicted values against the observed values.

# extract predicted probabilities
blrdata$Predicted <- predict(lr.glm, blrdata, type = "response")
# plot
ggplot(blrdata, aes(Age, Predicted, color = Gender)) +
  stat_summary(fun = mean, geom = "point") +
  stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.2) +
  theme_set(theme_bw(base_size = 10)) +
  theme(legend.position = "top") +
    ylim(0, .75) +
  labs(x = "", y = "Predicted Probabilty of eh") +
  scale_color_manual(values = c("gray20", "gray70"))

Model Diagnostics

We are now in a position to perform model diagnostics and test if the model violates distributional requirements. In a first step, we test for the existence of multicollinearity.

Multicollinearity

To check whether the final minimal model contains predictors that correlate with each other, we extract variance inflation factors (VIF). If a model contains predictors that have variance inflation factors (VIF) > 10 the model is completely unreliable and cannot claim the multicollinearity is absent (Myers 1990). Predictors causing such VIFs should be removed. Indeed, predictors with VIF values greater than 4 are usually already problematic but, for large data sets, even VIFs greater than 2 can lead inflated standard errors (Jaeger 2013). Also, VIFs of 2.5 can be problematic (Szmrecsanyi 2006, 215) and (Zuur, Ieno, and Elphick 2010) proposes that variables with VIFs exceeding 3 should be removed.

vif(lr.glm)
##      AgeOld GenderWomen 
##    1.004815    1.004815

In addition, predictors with 1/VIF values \(<\) .1 must be removed (data points with values above .2 are considered problematic) (Menard 1995) and the mean value of VIFs should be \(<\) 1 (Bowerman and O’Connell 1990).

mean(vif(lr.glm))
## [1] 1.004815

Outlier detection

In order to detect potential outliers, we will calculate diagnostic parameters and add these to our data set.

infl <- influence.measures(lr.glm) # calculate influence statistics
blrdata <- data.frame(blrdata, infl[[1]], infl[[2]]) # add influence statistics

In a next step, we use these diagnostic parameters to check if there are data points which should be removed as they unduly affect the model fit.

Sample Size

We now check whether the sample size is sufficient for our analysis (Green 1991). * if you are interested in the overall model: 50 + 8k (k = number of predictors) * if you are interested in individual predictors: 104 + k * if you are interested in both: take the higher value!

# function to evaluate sample size
smplesz <- function(x) {
  ifelse((length(x$fitted) < (104 + ncol(summary(x)$coefficients)-1)) == TRUE,
    return(
      paste("Sample too small: please increase your sample by ",
      104 + ncol(summary(x)$coefficients)-1 - length(x$fitted),
      " data points", collapse = "")),
    return("Sample size sufficient")) }
# apply unction to model
smplesz(lr.glm)
## [1] "Sample size sufficient"

According to rule of thumb provided in Green (1991), the sample size is sufficient for our analysis.

Summarizing Results

As a final step, we summarize our findings in tabulated form.

tab_model(lr.glm, lr.lrm)
## Profiled confidence intervals may take longer time to compute. Use 'df_method="wald"' for faster computation of CIs.
## Profiled confidence intervals may take longer time to compute. Use 'df_method="wald"' for faster computation of CIs.
  EH EH
Predictors Odds Ratios CI p Odds Ratios CI p
(Intercept) 0.79 0.76 – 0.83 <0.001
Age [Old] 0.44 0.41 – 0.47 <0.001
Gender [Women] 0.66 0.62 – 0.69 <0.001
Intercept 0.79 0.76 – 0.83 <0.001
Age=Old 0.44 0.41 – 0.47 <0.001
Gender=Women 0.66 0.62 – 0.69 <0.001
Observations 25821 25821
R2 Tjur 0.032 0.046

A more detailed summary table can be retrieved as follows:

blrsummary <- blrsummary(lr.glm, lr.lrm, predict.acc) # summarize regression analysis
# inspect data
datatable(blrsummary, options = list(pageLength = 5, scrollX=T), filter = "none")

R2 (Hosmer & Lemeshow)

Hosmer and Lemeshow’s R2 “is the proportional reduction in the absolute value of the log-likelihood measure and as such it is a measure of how much the badness of fit improves as a result of the inclusion of the predictor variables. It can vary between 0 (indicating that the predictors are useless at predicting the outcome variable) and 1 (indicating that the model predicts the outcome variable perfectly)” (Field, Miles, and Field 2012, 317).

R2 (Cox & Snell)

"Cox and Snell’s R2 (1989) is based on the deviance of the model (-2LL(new») and the deviance of the baseline model (-2LL(baseline), and the sample size, n […]. However, this statistic never reaches its theoretical maximum of 1.

R2 (Nagelkerke)

Since R2 (Cox & Snell) never reaches its theoretical maximum of 1, Nagelkerke (1991) suggested Nagelkerke’s R2 (Field, Miles, and Field 2012, 317–18).

Somers’ Dxy

Somers’ Dxy is a rank correlation between predicted probabilities and observed responses ranges between 0 (randomness) and 1 (perfect prediction). Somers’ Dxy should have a value higher than .5 for the model to be meaningful (Baayen 2008, 204).

C C is an index of concordance between the predicted probability and the observed response. When C takes the value 0.5, the predictions are random, when it is 1, prediction is perfect. A value above 0.8 indicates that the model may have some real predictive capacity (Baayen 2008, 204).

Akaike information criteria (AIC)

Akaike information criteria (AlC = -2LL + 2k) provide a value that reflects a ratio between the number of predictors in the model and the variance that is explained by these predictors. Changes in AIC can serve as a measure of whether the inclusion of a variable leads to a significant increase in the amount of variance that is explained by the model. “You can think of this as the price you pay for something: you get a better value of R2, but you pay a higher price, and was that higher price worth it? These information criteria help you to decide. The BIC is the same as the AIC but adjusts the penalty included in the AlC (i.e., 2k) by the number of cases: BlC = -2LL + 2k x log(n) in which n is the number of cases in the model” (Field, Miles, and Field 2012, 318).

1.4 Ordinal Regression

Ordinal regression is very similar to multiple linear regression but takes an ordinal dependent variable (Agresti 2010). For this reason, ordinal regression is one of the key methods in analysing Likert data.

To see how an ordinal regression is implemented in R, we load and inspect the “ordinaldata” data set. The data set consists of 400 observations of students that were either educated at this school (Internal = 1) or not (Internal = 0). Some of the students have been abroad (Exchange = 1) while other have not (Exchange = 0). In addition, the data set contains the students’ final score of a language test (FinalScore) and the dependent variable which the recommendation of a committee for an additional, very prestigious program. The recommendation has three levels (“very likely”, “somewhat likely”, and “unlikely”) and reflects the committees’ assessment of whether the student is likely to succeed in the program.

# load data
ordata <- read.delim("https://slcladal.github.io/data/ordinaldata.txt", sep = "\t", header = T)
colnames(ordata) <- c("Recommend", "Internal", "Exchange", "FinalScore")
# inspect data
str(ordata)
## 'data.frame':    400 obs. of  4 variables:
##  $ Recommend : chr  "very likely" "somewhat likely" "unlikely" "somewhat likely" ...
##  $ Internal  : int  0 1 1 0 0 0 0 0 0 1 ...
##  $ Exchange  : int  0 0 1 0 0 1 0 0 0 0 ...
##  $ FinalScore: num  3.26 3.21 3.94 2.81 2.53 ...

In a first step, we need to relevel the ordinal variable to represent an ordinal factor (or a progression from “unlikely” over “somewhat likely” to “very likely”. And we will also factorize Internal and Exchange to make it easier to interpret the output later on.

# relevel data
ordata <- ordata %>%
dplyr::mutate(Recommend = factor(Recommend, 
                           levels=c("unlikely", "somewhat likely", "very likely"),
                           labels=c("unlikely",  "somewhat likely",  "very likely"))) %>%
  dplyr::mutate(Exchange = ifelse(Exchange == 1, "Exchange", "NoExchange")) %>%
  dplyr::mutate(Internal = ifelse(Internal == 1, "Internal", "External"))

Now that the dependent variable is releveled, we check the distribution of the variable levels by tabulating the data. To get a better understanding of the data we create frequency tables across variables rather than viewing the variables in isolation.

## three way cross tabs (xtabs) and flatten the table
ftable(xtabs(~ Exchange + Recommend + Internal, data = ordata))
##                            Internal External Internal
## Exchange   Recommend                                 
## Exchange   unlikely                       25        6
##            somewhat likely                12        4
##            very likely                     7        3
## NoExchange unlikely                      175       14
##            somewhat likely                98       26
##            very likely                    20       10

We also check the mean and standard deviation of the final score as final score is a numeric variable and cannot be tabulated (unless we convert it to a factor).

summary(ordata$FinalScore); sd(ordata$FinalScore)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.900   2.720   2.990   2.999   3.270   4.000
## [1] 0.3979409

The lowest score is 1.9 and the highest score is a 4.0 with a mean of approximately 3. Finally, we inspect the distributions graphically.

# visualize data
ggplot(ordata, aes(x = Recommend, y = FinalScore)) +
  geom_boxplot(size = .75) +
  geom_jitter(alpha = .5) +
  facet_grid(Exchange ~ Internal, margins = TRUE) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))

We see that we have only few students that have taken part in an exchange program and there are also only few internal students overall. With respect to recommendations, only few students are considered to very likely succeed in the program. We can now start with the modeling by using the polr function. To make things easier for us, we will only consider the main effects here as this tutorial only aims to how to implement an ordinal regression but not how it should be done in a proper study - then, the model fitting and diagnostic procedures would have to be performed accurately, of course.

## fit ordered logit model and store results 'm'
m <- polr(Recommend ~ Internal + Exchange + FinalScore, data = ordata, Hess=TRUE)
## view a summary of the model
summary(m)
## Call:
## polr(formula = Recommend ~ Internal + Exchange + FinalScore, 
##     data = ordata, Hess = TRUE)
## 
## Coefficients:
##                      Value Std. Error t value
## InternalInternal   1.04766     0.2658   3.942
## ExchangeNoExchange 0.05868     0.2979   0.197
## FinalScore         0.61574     0.2606   2.363
## 
## Intercepts:
##                             Value  Std. Error t value
## unlikely|somewhat likely    2.2620 0.8822     2.5641 
## somewhat likely|very likely 4.3574 0.9045     4.8177 
## 
## Residual Deviance: 717.0249 
## AIC: 727.0249

The results show that having studied here at this school increases the chances of receiving a positive recommendation but that having been on an exchange has a negative but insignificant effect on the recommendation. The final score also correlates positively with a positive recommendation but not as much as having studied here.

## store table
(ctable <- coef(summary(m)))
##                                  Value Std. Error   t value
## InternalInternal            1.04766394  0.2657891 3.9417109
## ExchangeNoExchange          0.05868108  0.2978588 0.1970097
## FinalScore                  0.61574360  0.2606313 2.3625085
## unlikely|somewhat likely    2.26199764  0.8821736 2.5641185
## somewhat likely|very likely 4.35744190  0.9044678 4.8176858

As the regression report does not provide p-values, we have to calculate them separately (after having calculated them, we add them to the coefficient table).

## calculate and store p values
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
## combined table
(ctable <- cbind(ctable, "p value" = p))
##                                  Value Std. Error   t value        p value
## InternalInternal            1.04766394  0.2657891 3.9417109 0.000080902439
## ExchangeNoExchange          0.05868108  0.2978588 0.1970097 0.843819939315
## FinalScore                  0.61574360  0.2606313 2.3625085 0.018151726606
## unlikely|somewhat likely    2.26199764  0.8821736 2.5641185 0.010343823241
## somewhat likely|very likely 4.35744190  0.9044678 4.8176858 0.000001452328

As predicted, Exchange does not have a significant effect but FinalScore and Internal both correlate significantly with the likelihood of receiving a positive recommendation.

# extract profiled confidence intervals
ci <- confint(m)
## Waiting for profiling to be done...
# calculate odds ratios and combine them with profiled CIs
exp(cbind(OR = coef(m), ci))
##                          OR     2.5 %   97.5 %
## InternalInternal   2.850983 1.6958378 4.817114
## ExchangeNoExchange 1.060437 0.5950332 1.919771
## FinalScore         1.851033 1.1136253 3.098491

The odds ratios show that internal students are 2.85 or 285 times as likely as non-internal students to receive positive evaluations and that a 1-point increase in the test score lead to a 1.85 or 185 percent increase in the chances of receiving a positive recommendation. The effect of an exchange is slightly negative but, as we have seen above, not significant.

1.5 Poisson Regression

Poisson regressions are used to analyse data where the dependent variable represents counts. In particular counts that are based on observations of something that is measured in set intervals. For instances the number of pauses in two-minute-long conversations. Poisson regressions are particularly appealing when dealing with rare events, i.e. when something only occurs very infrequently. In such cases, normal linear regressions do not work because the instances that do occur are automatically considered outliers. Therefore, it is useful to check if the data conform to a Poisson distribution.

However, the tricky thing about Poisson regressions is that the data has to conform to the Poisson distribution which is, according to my experience, rarely the case, unfortunately. In contrast to the Gaussian Normal Distribution which is very flexible because it is defined by two parameters, the mean (mu, i.e. \(\mu\)) and the standard deviation (sigma, i.e. \(\sigma\)). This allows the normal distribution to take very different shapes (for example, very high and slim (compressed) or very wide and flat). In contrast, the Poisson is defined by only one parameter (lambda, i.e. \(\lambda\)) which mean that if we have a mean of 2, then the standard deviation is also 2 (actually we would have to say that the mean is \(\lambda\) and the standard deviation is also \(\lambda\) or \(\lambda\) = \(\mu\) = \(\sigma\)). This is much trickier for natural data as this means that the Poisson distribution is very rigid.

As we can see, as \(\lambda\) takes on higher values, the distribution becomes wider and flatter - a compressed distribution with a high mean can therefore not be Poisson-distributed. We will now start by loading the data.

# load data
poissondata <- read.delim("https://slcladal.github.io/data/posdata.txt", sep = "\t", header = T, skipNul = T, quote = "")
# inspect data
summary(poissondata)
##        Id             Pauses     Language          Alcohol     
##  Min.   :  1.00   Min.   :0.00   NULL:German    Min.   :33.00  
##  1st Qu.: 50.75   1st Qu.:0.00   NULL:Russian   1st Qu.:45.00  
##  Median :100.50   Median :0.00   NULL:German    Median :52.00  
##  Mean   :100.50   Mean   :0.63   NULL:German    Mean   :52.65  
##  3rd Qu.:150.25   3rd Qu.:1.00   NULL:German    3rd Qu.:59.00  
##  Max.   :200.00   Max.   :6.00   NULL:Russian   Max.   :75.00  
##                                  NULL:German                   
##                                  NULL:German                   
##                                  NULL:German                   
##                                  NULL:German                   
##                                  NULL:German                   
##                                  NULL:English                  
##                                  NULL:German                   
##                                  NULL:German                   
##                                  NULL:German                   
##                                  NULL:Russian                  
##                                  NULL:Russian                  
##                                  NULL:German                   
##                                  NULL:English                  
##                                  NULL:German                   
##                                  NULL:Russian                  
##                                  NULL:German                   
##                                  NULL:German                   
##                                  NULL:German                   
##                                  NULL:English                  
##  [ erreichte getOption("max.print") --  175 Zeilen ausgelassen ]

We will clean the data by factorizing Id which is currently considered a numeric variable rather than a factor.

# process data
poissondata <- poissondata %>%
  mutate(Id = factor(Id, levels = 1:200, labels = 1:200))
# inspect data
summary(poissondata); str(poissondata)
##        Id          Pauses     Language          Alcohol     
##  1      :  1   Min.   :0.00   NULL:German    Min.   :33.00  
##  2      :  1   1st Qu.:0.00   NULL:Russian   1st Qu.:45.00  
##  3      :  1   Median :0.00   NULL:German    Median :52.00  
##  4      :  1   Mean   :0.63   NULL:German    Mean   :52.65  
##  5      :  1   3rd Qu.:1.00   NULL:German    3rd Qu.:59.00  
##  6      :  1   Max.   :6.00   NULL:Russian   Max.   :75.00  
##  (Other):194                  NULL:German                   
##                               NULL:German                   
##                               NULL:German                   
##                               NULL:German                   
##                               NULL:German                   
##                               NULL:English                  
##                               NULL:German                   
##                               NULL:German                   
##                               NULL:German                   
##                               NULL:Russian                  
##                               NULL:Russian                  
##                               NULL:German                   
##                               NULL:English                  
##                               NULL:German                   
##                               NULL:Russian                  
##                               NULL:German                   
##                               NULL:German                   
##                               NULL:German                   
##                               NULL:English                  
##  [ erreichte getOption("max.print") --  175 Zeilen ausgelassen ]
## 'data.frame':    200 obs. of  4 variables:
##  $ Id      : Factor w/ 200 levels "1","2","3","4",..: 45 108 15 67 153 51 164 133 2 53 ...
##  $ Pauses  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Language: chr  "German" "Russian" "German" "German" ...
##  $ Alcohol : int  41 41 44 42 40 42 46 40 33 46 ...

First, we check if the conditions for a Poisson regression are met.

# output the results
gf = goodfit(poissondata$Pauses,type= "poisson",method= "ML")
summary(gf)
## 
##   Goodness-of-fit test for poisson distribution
## 
##                       X^2 df       P(> X^2)
## Likelihood Ratio 33.01229  5 0.000003742341

If the p-values is smaller than .05, then data is not Poisson distributed which means that it differs significantly from a Poisson distribution and is very likely over-dispersed. We will check the divergence from a Poisson distribution visually by plotting the observed counts against the expected counts if the data were Poisson distributed.

plot(gf,main="Count data vs Poisson distribution")

Although the goodfit function reported that the data differs significantly from the Poisson distribution, the fit is rather good. We can use an additional Levene’s test to check if variance homogeneity is given.

# check homogeneity
leveneTest(poissondata$Pauses, poissondata$Language, center = mean)
## Warning in leveneTest.default(poissondata$Pauses, poissondata$Language, : poissondata$Language coerced to factor.
## Levene's Test for Homogeneity of Variance (center = mean)
##        Df F value       Pr(>F)    
## group   2  17.153 0.0000001357 ***
##       197                         
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Levene’s test indicates that variance homogeneity is also violated. Since both the approximation to a Poisson distribution and variance homogeneity are violated, we should switch either to a quasi-Poisson model or a negative binomial model. However, as we are only interested in how to implement a Poisson model here, we continue despite the fact that this could not be recommended if we were actually interested in accurate results based on a reliable model.

In a next step, we summarize Progression by inspecting the means and standard deviations of the individual variable levels.

# extract mean and standard devaiation
with(poissondata, tapply(Pauses, Language, function(x) {
  sprintf("M (SD) = %1.2f (%1.2f)", mean(x), sd(x))
}))
##                English                 German                Russian 
## "M (SD) = 1.00 (1.28)" "M (SD) = 0.24 (0.52)" "M (SD) = 0.20 (0.40)"

Now, we visualize the data.

# plot data
ggplot(poissondata, aes(Pauses, fill = Language)) +
  geom_histogram(binwidth=.5, position="dodge") +
  scale_fill_manual(values=c("gray30", "gray50", "gray70"))

# calculate poissant regression
m1.poisson <- glm(Pauses ~ Language + Alcohol, family="poisson", data=poissondata)
# inspect model
summary(m1.poisson)
## 
## Call:
## glm(formula = Pauses ~ Language + Alcohol, family = "poisson", 
##     data = poissondata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2043  -0.8436  -0.5106   0.2558   2.6796  
## 
## Coefficients:
##                 Estimate Std. Error z value        Pr(>|z|)    
## (Intercept)     -4.16327    0.66288  -6.281 0.0000000003373 ***
## LanguageGerman  -0.71405    0.32001  -2.231         0.02566 *  
## LanguageRussian -1.08386    0.35825  -3.025         0.00248 ** 
## Alcohol          0.07015    0.01060   6.619 0.0000000000363 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 287.67  on 199  degrees of freedom
## Residual deviance: 189.45  on 196  degrees of freedom
## AIC: 373.5
## 
## Number of Fisher Scoring iterations: 6

In addition to the Estimates for the coefficients, we could also calculate the confidence intervals for the coefficients (LL stands for lower limit and UL for upper limit in the table below).

cov.m1 <- vcovHC(m1.poisson, type="HC0")
std.err <- sqrt(diag(cov.m1))
r.est <- cbind(Estimate= coef(m1.poisson), "Robust SE" = std.err,
"Pr(>|z|)" = 2 * pnorm(abs(coef(m1.poisson)/std.err), lower.tail=FALSE),
LL = coef(m1.poisson) - 1.96 * std.err,
UL = coef(m1.poisson) + 1.96 * std.err)
# inspect data
r.est
##                   Estimate  Robust SE            Pr(>|z|)          LL          UL
## (Intercept)     -4.1632653 0.64809429 0.00000000013286377 -5.43353007 -2.89300044
## LanguageGerman  -0.7140499 0.29864225 0.01680312040111837 -1.29938873 -0.12871111
## LanguageRussian -1.0838591 0.32104816 0.00073547448241672 -1.71311353 -0.45460476
## Alcohol          0.0701524 0.01043516 0.00000000001783975  0.04969947  0.09060532
with(m1.poisson, cbind(res.deviance = deviance, df = df.residual,
  p = pchisq(deviance, df.residual, lower.tail=FALSE)))
##      res.deviance  df         p
## [1,]     189.4496 196 0.6182274
## update m1 model dropping prog
m2.poisson <- update(m1.poisson, . ~ . - Language)
## test model differences with chi square test
anova(m2.poisson, m1.poisson, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: Pauses ~ Alcohol
## Model 2: Pauses ~ Language + Alcohol
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       198     204.02                          
## 2       196     189.45  2   14.572 0.0006852 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
s <- deltamethod(list(~ exp(x1), ~ exp(x2), ~ exp(x3), ~ exp(x4)), 
                                                coef(m1.poisson), cov.m1)

## exponentiate old estimates dropping the p values
rexp.est <- exp(r.est[, -3])
## replace SEs with estimates for exponentiated coefficients
rexp.est[, "Robust SE"] <- s

rexp.est
##                   Estimate  Robust SE         LL         UL
## (Intercept)     0.01555668 0.01008219 0.00436765 0.05540971
## LanguageGerman  0.48965711 0.14623230 0.27269844 0.87922793
## LanguageRussian 0.33828750 0.10860658 0.18030354 0.63469878
## Alcohol         1.07267164 0.01119351 1.05095521 1.09483681
# extract predicted values
(s1 <- data.frame(Alcohol = mean(poissondata$Alcohol),
  Language = factor(1:3, levels = 1:3, labels = names(table(poissondata$Language)))))
##   Alcohol Language
## 1  52.645  English
## 2  52.645   German
## 3  52.645  Russian
predict(m1.poisson, s1, type="response", se.fit=TRUE)
## $fit
##         1         2         3 
## 0.6249446 0.3060086 0.2114109 
## 
## $se.fit
##          1          2          3 
## 0.08628117 0.08833706 0.07050108 
## 
## $residual.scale
## [1] 1
## calculate and store predicted values
poissondata$Predicted <- predict(m1.poisson, type="response")

## order by program and then by math
poissondata <- poissondata[with(poissondata, order(Language, Alcohol)), ]
## create the plot
ggplot(poissondata, aes(x = Alcohol, y = Predicted, colour = Language)) +
  geom_point(aes(y = Pauses), alpha=.5, 
             position=position_jitter(h=.2)) +
  geom_line(size = 1) +
  labs(x = "Alcohol (ml)", y = "Expected number of pauses") +
  scale_color_manual(values=c("gray30", "gray50", "gray70"))

1.6 Robust Regression

Robust regressions are linear regressions with added weights (Rousseeuw and Leroy 2005) and are thus used when a linear model is unduly affected by outliers but the data points should not be removed.

# load data
robustdata <- read.delim("https://slcladal.github.io/data/mlrdata.txt", sep = "\t", header = T)
# inspect data
summary(robustdata)
##   status             attraction               money       
##  NULL:Relationship   NULL:NotInterested   Min.   :  0.93  
##  NULL:Relationship   NULL:NotInterested   1st Qu.: 49.84  
##  NULL:Relationship   NULL:NotInterested   Median : 81.73  
##  NULL:Relationship   NULL:NotInterested   Mean   : 88.38  
##  NULL:Relationship   NULL:NotInterested   3rd Qu.:121.57  
##  NULL:Relationship   NULL:NotInterested   Max.   :200.99  
##  NULL:Relationship   NULL:NotInterested                   
##  NULL:Relationship   NULL:NotInterested                   
##  NULL:Relationship   NULL:NotInterested                   
##  NULL:Relationship   NULL:NotInterested                   
##  NULL:Relationship   NULL:NotInterested                   
##  NULL:Relationship   NULL:NotInterested                   
##  NULL:Relationship   NULL:NotInterested                   
##  NULL:Relationship   NULL:NotInterested                   
##  NULL:Relationship   NULL:NotInterested                   
##  NULL:Relationship   NULL:NotInterested                   
##  NULL:Relationship   NULL:NotInterested                   
##  NULL:Relationship   NULL:NotInterested                   
##  NULL:Relationship   NULL:NotInterested                   
##  NULL:Relationship   NULL:NotInterested                   
##  NULL:Relationship   NULL:NotInterested                   
##  NULL:Relationship   NULL:NotInterested                   
##  NULL:Relationship   NULL:NotInterested                   
##  NULL:Relationship   NULL:NotInterested                   
##  NULL:Relationship   NULL:NotInterested                   
##  NULL:Relationship   NULL:Interested                      
##  NULL:Relationship   NULL:Interested                      
##  NULL:Relationship   NULL:Interested                      
##  NULL:Relationship   NULL:Interested                      
##  NULL:Relationship   NULL:Interested                      
##  NULL:Relationship   NULL:Interested                      
##  NULL:Relationship   NULL:Interested                      
##  NULL:Relationship   NULL:Interested                      
##  [ erreichte getOption("max.print") --  67 Zeilen ausgelassen ]

We now fit an ordinary linear model (and although we know from the section on multiple regression, that the interaction between status and attraction is significant, we will disregard this for now as this will help to explain the weighing procedure which is the focus of this section).

# create model
slm <- lm(money ~ status+attraction, data = robustdata)
# inspect model
summary(slm)
## 
## Call:
## lm(formula = money ~ status + attraction, data = robustdata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -60.87 -15.79  -2.61  13.89  59.94 
## 
## Coefficients:
##                         Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)              114.949      4.290   26.80 < 0.0000000000000002 ***
## statusSingle              26.103      4.954    5.27          0.000000826 ***
## attractionNotInterested  -79.252      4.954  -16.00 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.77 on 97 degrees of freedom
## Multiple R-squared:  0.7452, Adjusted R-squared:   0.74 
## F-statistic: 141.9 on 2 and 97 DF,  p-value: < 0.00000000000000022

We now check whether the model is well fitted using diagnostic plots.

# create model diagnost plots
opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0))
plot(slm, las = 1)
## hat values (leverages) are all = 0.03
##  and there are no factor predictors; no plot no. 5

par(opar)

The diagnostic plots indicate that there are three outliers in the data (data points 52, 83 and possibly 64). Therefore, we need to evaluate if the outliers severely affect the fit of the model.

robustdata[c(52, 64, 83), 1:2]
##    status    attraction
## 52 Single NotInterested
## 64 Single NotInterested
## 83 Single    Interested

We can now calculate Cook’s distance and standardized residuals check if the values of the potentially problematic points have unaccepatbly high values (-2 < ok < 2).

CooksDistance <- cooks.distance(slm)
StandardizedResiduals <- stdres(slm)
a <- cbind(robustdata, CooksDistance, StandardizedResiduals)
a[CooksDistance > 4/100, ]
##          status    attraction  money CooksDistance StandardizedResiduals
## 1  Relationship NotInterested  86.33    0.04441588              2.075654
## 52       Single NotInterested   0.93    0.06419375             -2.495354
## 65       Single NotInterested  12.12    0.04276136             -2.036628
## 67       Single NotInterested  13.28    0.04078780             -1.989074
## 83       Single    Interested 200.99    0.06223971              2.457082
## 84       Single    Interested 193.69    0.04800208              2.157823
## 88       Single    Interested 193.78    0.04816637              2.161513

We will calculate the absolute value of oder the table so that it is easier to check the values.

AbsoluteStandardizedResiduals <- abs(StandardizedResiduals)
a <- cbind(robustdata, CooksDistance, StandardizedResiduals, AbsoluteStandardizedResiduals)
asorted <- a[order(-AbsoluteStandardizedResiduals), ]
asorted[1:10, ]
##          status    attraction  money CooksDistance StandardizedResiduals AbsoluteStandardizedResiduals
## 52       Single NotInterested   0.93    0.06419375             -2.495354                      2.495354
## 83       Single    Interested 200.99    0.06223971              2.457082                      2.457082
## 88       Single    Interested 193.78    0.04816637              2.161513                      2.161513
## 84       Single    Interested 193.69    0.04800208              2.157823                      2.157823
## 1  Relationship NotInterested  86.33    0.04441588              2.075654                      2.075654
## 65       Single NotInterested  12.12    0.04276136             -2.036628                      2.036628
## 67       Single NotInterested  13.28    0.04078780             -1.989074                      1.989074
## 78       Single    Interested 188.76    0.03943140              1.955721                      1.955721
## 21 Relationship NotInterested  81.90    0.03698374              1.894049                      1.894049
## 24 Relationship NotInterested  81.56    0.03644143              1.880111                      1.880111

As Cook’s distance and the standardized residuals do have unaccepatble values, we re-calculate the linear model as a robust regression and inspect the results

# create robust regression model
rmodel <- rlm(money ~ status + attraction, data = robustdata)
# inspect model
summary(rmodel)
## 
## Call: rlm(formula = money ~ status + attraction, data = robustdata)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -61.329 -15.180  -1.305  14.434  62.386 
## 
## Coefficients:
##                         Value    Std. Error t value 
## (Intercept)             112.9273   4.2760    26.4093
## statusSingle             25.6769   4.9375     5.2003
## attractionNotInterested -76.3451   4.9375   -15.4622
## 
## Residual standard error: 22.54 on 97 degrees of freedom

We will briefly check the weights to understand the process of weighing better. The idea of weighing is to downgrade data points that are too influential while not punishing data points that have a good fit and are thus less influential. This means that the problematic data points should have lower weights than other data points (the maximum is 1 - so points can only be made “lighter”).

hweights <- data.frame(status = robustdata$status, resid = rmodel$resid, weight = rmodel$w)
hweights2 <- hweights[order(rmodel$w), ]
hweights2[1:15, ]
##          status     resid    weight
## 83       Single  62.38576 0.4860068
## 52       Single -61.32916 0.4943507
## 88       Single  55.17576 0.5495187
## 84       Single  55.08576 0.5504166
## 78       Single  50.15576 0.6045227
## 65       Single -50.13916 0.6046785
## 1  Relationship  49.74775 0.6094242
## 67       Single -48.97916 0.6189993
## 21 Relationship  45.31775 0.6689962
## 24 Relationship  44.97775 0.6740532
## 39 Relationship -43.63733 0.6947534
## 79       Single  40.78576 0.7434152
## 58       Single -40.70916 0.7447469
## 89       Single  39.93576 0.7592395
## 95       Single  39.78576 0.7621022

The values of the weights support our assumption that those data points that were deemed too influential are made “lighter” as they now only have weights of .486 and .494 respectively.

After inspecting the weights, we also want to extract the p-values for the predictors. The p-values have to be calculated separately using the “f.robftest” fucntion from the “sfsmisc” library.

p_status <- f.robftest(rmodel, var = 2)
p_attraction <- f.robftest(rmodel, var = 3)
# inspect results
p_status; p_attraction
## 
##  robust F-test (as if non-random weights)
## 
## data:  from rlm(formula = money ~ status + attraction, data = robustdata)
## F = 27.015, p-value = 0.000001119
## alternative hypothesis: true statusSingle is not equal to 0
## 
##  robust F-test (as if non-random weights)
## 
## data:  from rlm(formula = money ~ status + attraction, data = robustdata)
## F = 239.01, p-value < 0.00000000000000022
## alternative hypothesis: true attractionNotInterested is not equal to 0

The output shows that both status and attraction are significant but, as we have seen above, the effect that really matters is the interaction between status and attraction. This was, however, not the focus of this sections as this section meerly served to introduce the concept of weights and how they can be used in the context of a robust linear regression.

2 Mixed-Effects Regression

In contrast to fixed-effects regression models, mixed-effects models are not simple additive models because they are based on complex matrix multiplications where predicted values represent the product of the random effects multiplied by the intercept values plus the estimates of the fixed effects component in the model.

Mixed-effects models are rapidly increasing in use in data analysis because they allow us to incorporate hierarchical or nested data structures. Mixed-effects models are, of course, an extension of fixed-effects regression models and also multivariate and come in different types.

In the following, we will go over the most relevant and frequently used types of mixed-effect regression models, mixed-effects linear regression models and mixed-effects binomial logistic regression models.

The major difference between these types of models is that they take different types of dependent variables. While linear models take numeric dependent variables, logistic models take nominal variables.

2.1 Linear Mixed-Effects Regression

The following focuses on an extension of ordinary multiple linear regressions: mixed-effects regression linear regression. Mixed-effects models have the following advantages over simpler statistical tests:

  • Mixed-effects models are multivariate, i.e. they test the effect of several predictors simultaneously while controlling for the effect of all other predictors.

  • Mixed models allow to statistically incorporate within-speaker variability and are thus fit to model hierarchical or nested data structures. This applies if several observations are produced by an individual speaker, for instance.

  • Mixed-models provide a wealth of diagnostic statistics which enables us to control e.g. (multi-)collinearity, i.e. correlations between predictors, and to test whether conditions or requirements are violated (e.g. homogeneity of variance, etc.).

Major disadvantages of mixed-effects regression modeling are that they are prone to producing high \(\beta\)-errors (see Johnson 2009) and that they require rather large data sets.

Introduction

So far, the regression models that we have used only had fixed-effects. Having only fixed-effects means that all data points are treated as if they are completely independent and thus on the same hierarchical level. However, it is very common, that the data is nested in the sense that data points are not independent because they are, for instance produced by the same speaker or are grouped by some other characteristic. In such cases, the data is considered hierarchical and statistical models should incorporate such structural features of the data they work upon. With respect to regression modeling, hierarchical structures are incorporated by what is called random effects. When models only have a fixed-effects structure, then they make use of only a single intercept and/or slope (as in the left panel in the figure below), while mixed effects models have intercepts for each level of a random effect. If the random effect structure represents speakers then this would mean that a mixed-model would have a separate intercept and/or slope for each speaker.

Random Effects

Random Effects can have two parameters: the intercept (the point where the regression line crosses the y-axis) and the slope (the acclivity of the regression line). In contrast to fixed-effects models, that have only 1 intercept and one slope (left panel of the Figure above), mixed-effects models can therefore have various random intercepts (centre left panel ) or various random slopes (centre right panel ), or both, various random intercepts and various random slopes (right panel in the Figure).

What features do distinguish random and fixed effects?

  1. Random effects represent a higher level variable under which data points are grouped.

  2. Random effects represent a sample of an infinite number of possible levels. For instance, speakers represent a potentially infinite pool of elements from which a many different samples can be drawn. Thus, random effects represent a random sample sample. Fixed effects, on the other hand, typically do not represent a random sample but a fixed set of variable levels (e.g. Age groups, or parts-of-speech).

  3. Random effects typically represent many different levels while fixed effects typically have only a few. Zuur, Hilbe, and Ieno (2013) propose that a variable may be used as a fixed effect if it has less than 5 levels while it should be treated as a random effect if it has more than 10 levels. Variables with 5 to 10 levels can be used as both. However, this is a rule of thumb and ignores the theoretical reasons (random sample and nestedness) for considering something as a random effect and it also is at odds with the way that repeated measures are models (namely as mixed effects) alhough they typically only have very few levels.

In the following, we will only focus on models with random intercepts because this is the by far more common method and because including both random intercepts and random slopes requires huge amounts of data. Consider the Figure below to understand what is meant by “random intercepts”.

The left panel merely shows the data while the centre panel includes the regression line for a regression that estimates Weight based on Height. The right panel shows the regression line and, in addition, random intercepts each of the three groups.

After adding random intercepts, predictors (or fixed effects) are added to the model (just like with multiple regression). So mixed-effects are called mixed-effects because they contain both random and fixed effects.

In terms of general procedure, random effects are added first, and only after we have ascertained that including random effects is warranted, we test whether including fixed-effects is warranted (Field, Miles, and Field 2012). We test whether including random effects is warranted by comparing a model, that bases its estimates of the depended variable solely on the base intercept (the mean), with a model, that bases its estimates of the dependent variable solely on the intercepts of the random effect. If the random-effect model explains significantly more variance than the simple model without random effect structure, then we continue with the mixed-effects model. In other words, including random effects is justified if they reduce residual deviance.

Example: Preposition Use across Time by Genre

To explore how to implement a mixed-effects model in R we revisit the preposition data that contains relative frequencies of prepositions in English texts written between 1150 and 1913. As a first step, and to prepare our analysis, we load necessary R packages, specify options, and load as well as provide an overview of the data.

# suppress scientific notation
options("scipen" = 100, "digits" = 4)      
# do not convert strings into factors
options(stringsAsFactors = F)              
# read in data
lmmdata <- read.delim("https://slcladal.github.io/data/lmmdata.txt", header = TRUE) %>%
# convert date into a numeric variable
    dplyr::mutate(Date = as.numeric(Date))
# inspect data
datatable(lmmdata, rownames = FALSE, options = list(pageLength = 5, scrollX=T), filter = "none")

The data set contains the date when the text was written (Date), the genre of the text (Genre), the name of the text (Text), the relative frequency of prepositions in the text (Prepositions), and the region in which the text was written (Region). We now plot the data to get a first impression of its structure.

p1 <- ggplot(lmmdata, aes(x = Date, y = Prepositions)) +
  geom_point() +
  geom_smooth(method = "lm", se = F, color = "red", linetype = "dashed") +
  theme_bw() +
  labs(y = "Frequency\n(Prepositions)")
p2 <- ggplot(lmmdata, aes(x = reorder(Genre, -Prepositions), y = Prepositions)) +
  geom_boxplot() +
  theme_bw() + 
  theme(axis.text.x = element_text(angle=90)) +
  labs(x = "Genre", y = "Frequency\n(Prepositions)")
p3 <- ggplot(lmmdata, aes(Prepositions)) +
  geom_histogram() +
  theme_bw() + 
  labs(y = "Count", x = "Frequency (Prepositions)")
grid.arrange(grobs = list(p1, p2, p3), widths = c(1, 1), layout_matrix = rbind(c(1, 1), c(2, 3)))

The scatter plot in the upper panel indicates that the use of prepositions has moderately increased over time while the boxplots in the lower left panel show that the genres differ quite substantially with respect to their median frequencies of prepositions per text. Finally, the histogram in the lower right panel show that preposition use is distributed normally with a mean of 132.2 prepositions per text.

p4 <- ggplot(lmmdata, aes(Date, Prepositions)) +
  geom_point() +
  labs(x = "Year", y = "Prepositions per 1,000 words") +
  geom_smooth(method = "lm")  + 
  theme_bw()
p5 <- ggplot(lmmdata, aes(Region, Prepositions)) +
  geom_boxplot() +
  labs(x = "Region", y = "Prepositions per 1,000 words") +
  geom_smooth(method = "lm")  + 
  theme_bw()
grid.arrange(p4, p5, nrow = 1)

ggplot(lmmdata, aes(Date, Prepositions)) +
  geom_point() +
  facet_wrap(~ Genre, nrow = 4) +
  geom_smooth(method = "lm") +
  theme_bw() +
  labs(x = "Date of composition", y = "Prepositions per 1,000 words") +
  coord_cartesian(ylim = c(0, 220))

Centering or scaling numeric variables is useful for later interpretation of regression models: if the date variable was not centered, the regression would show the effects of variables at year 0(!). If numeric variables are scaled, other variables are variables are considered relative not to 0 but to the mean of that variable (in this case the mean of years in our data). Centering simply means that the mean of the numeric variable is subtracted from each value.

lmmdata$DateUnscaled <- lmmdata$Date
lmmdata$Date <- scale(lmmdata$Date, scale = F)
# inspect data
datatable(lmmdata, rownames = FALSE, options = list(pageLength = 5, scrollX=T), filter = "none")

We now set up a fixed-effects model with the “glm” function and a mixed-effects model using the “glmer” function with Genre as a random effect.

# generate models
m0.glm <- glm(Prepositions ~ 1, family = gaussian, data = lmmdata)
m0.glmer = glmer(Prepositions ~ 1 + (1|Genre), data = lmmdata, family = "gaussian")

Now that we have created the base-line models, we will test whether including a random effect structure is mathematically justified. It is important to note here that we are not going to test if including a random effect structure is theoretically motivated but simply if it causes a decrease in variance.

Testing Random Effects

As a first step in the modeling process, we now need to determine whether or not including a random effect structure is justified. We do so by comparing the base-line model without random intercepts to the model with random intercepts using a Likelihood Ratio Test. A short word of warning is in order here regarding the specific of the model: we need to set “REML = T” because Relative Estimate Maximum Likelihood (REML) provides better estimates for the random effects part of the model compared with the simpler Maximum Likelihood (ML) specification (Field, Miles, and Field 2012, 879).

x2 = -2*logLik(m0.glm, REML = T)+2*logLik(m0.glmer, REML = T)
## Warning in logLik.glm(m0.glm, REML = T): extra arguments discarded
x2 <- x2 <- x2[[1]]
list(x2, pchisq(x2, df=2, lower.tail=F))
## [[1]]
## [1] 222.4
## 
## [[2]]
## [1] 0.0000000000000000000000000000000000000000000000005049

The inclusion of a random effect structure with random intercepts is justified based on the Likelihood Ratio Test.

As a note on model comparisons: when we compare mixed-effects models, the REML specification must be “FALSE” or set to “method =”ML" (Maximum Likelihood) (Field, Miles, and Field 2012, 879). This is because “ML” produces more accurate estimates of fixed regression parameters. In contrast, one should use “REML” when comparing fixed to mixed models as REML produces more accurate estimates of random variances (Twisk 2006).

Model Fitting

After having determined that including a random effect structure is justified, we can continue by fitting the model and including diagnostics as we go. Including diagnostics in the model fitting process can save time and prevent relying on models which only turn out to be unstable if we would perform the diagnostics after the fact.

We begin fitting our model by adding Date as a fixed effect and compare this model to our mixed-effects base-line model to see if Date improved the model fit by explaining variance and if Date significantly correlates with our dependent variable (this means that the difference between the models is the effect (size) of “Date”!)

m1.glmer <- glmer(Prepositions ~ (1|Genre) + Date, data = lmmdata)
## Warning in glmer(Prepositions ~ (1 | Genre) + Date, data = lmmdata): calling glmer() with family=gaussian (identity link) as a shortcut to
## lmer() is deprecated; please call lmer() directly
anova(m1.glmer, m0.glmer, test = "Chi")
## refitting model(s) with ML (instead of REML)
## Data: lmmdata
## Models:
## m0.glmer: Prepositions ~ 1 + (1 | Genre)
## m1.glmer: Prepositions ~ (1 | Genre) + Date
##          npar  AIC  BIC logLik deviance Chisq Df Pr(>Chisq)   
## m0.glmer    3 4502 4515  -2248     4496                       
## m1.glmer    4 4495 4512  -2244     4487  8.93  1     0.0028 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model with Date is the better model (significant p-value and lower AIC). The significant p-value shows that “Date” correlates significantly with “Prepositions” (\(\chi\)2(1) = 8.93, p = .0028). The \(\chi\)2 value here is labelled “L.Ratio” and the degrees of freedom are calculated by subtracting the smaller number of DFs from the larger number of DFs.

We now test whether Region should also be part of the final minimal adequate model. The easiest way to add predictors is by using the “predict” function (it saves time and typing).

# generate model
m2.glmer <- update(m1.glmer, .~.+ Region)
# test vifs
car::vif(m2.glmer)
##   Date Region 
##  1.203  1.203
# compare models                
anova(m2.glmer, m1.glmer, test = "Chi")
## refitting model(s) with ML (instead of REML)
## Data: lmmdata
## Models:
## m1.glmer: Prepositions ~ (1 | Genre) + Date
## m2.glmer: Prepositions ~ (1 | Genre) + Date + Region
##          npar  AIC  BIC logLik deviance Chisq Df Pr(>Chisq)
## m1.glmer    4 4495 4512  -2244     4487                    
## m2.glmer    5 4495 4516  -2242     4485  2.39  1       0.12

Three things tell us that Region should not be included: (i) the AIC does not decrease, (ii) the BIC increases(!), and the p-value is higher than .05. This means, that we will continue fitting the model without having Region included. Well… not quite - just as a note on including variables: while Region is not significant as a main effect, it must still be included in a model if it were part of a significant interaction. To test if this is indeed the case, we fit another model with the interaction between Date and Region as predictor.

# generate model
m3.glmer <- update(m1.glmer, .~.+Region*Date)
# extract vifs
car::vif(m3.glmer)
##        Date      Region Date:Region 
##       1.969       1.203       1.780
# compare models                
anova(m3.glmer, m1.glmer, test = "Chi")
## refitting model(s) with ML (instead of REML)
## Data: lmmdata
## Models:
## m1.glmer: Prepositions ~ (1 | Genre) + Date
## m3.glmer: Prepositions ~ (1 | Genre) + Date + Region + Date:Region
##          npar  AIC  BIC logLik deviance Chisq Df Pr(>Chisq)
## m1.glmer    4 4495 4512  -2244     4487                    
## m3.glmer    6 4496 4522  -2242     4484  2.89  2       0.24

Again, the high p-value and the increase in AIC and BIC show that we have found our minimal adequate model with only contains Date as a main effect. In a next step, we can inspect the final minimal adequate model, i.e. the most parsimonious (the model that explains a maximum of variance with a minimum of predictors).

# inspect results
summary(m1.glmer)
## Linear mixed model fit by REML ['lmerMod']
## Formula: Prepositions ~ (1 | Genre) + Date
##    Data: lmmdata
## 
## REML criterion at convergence: 4491
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -3.735 -0.657  0.006  0.661  3.597 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Genre    (Intercept) 159      12.6    
##  Residual             229      15.1    
## Number of obs: 537, groups:  Genre, 16
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept) 133.88516    3.24749    41.2
## Date          0.01894    0.00632     3.0
## 
## Correlation of Fixed Effects:
##      (Intr)
## Date 0.005

Model Diagnostics

We can now evaluate the goodness of fit of the model and check if mathematical requirements and assumptions have been violated. In a first step, we generate diagnostic plots that focus on the random effect structure.

plot(m1.glmer, Genre ~ resid(.), abline = 0 ) # generate diagnostic plots

The plot shows that there are some outliers (points outside the boxes) and that the variability within letters is greater than in other genres we therefore examine the genres in isolation standardized residuals versus fitted values (Pinheiro and Bates 2000, 175).

plot(m1.glmer, resid(., type = "pearson") ~ fitted(.) | Genre, id = 0.05, 
     adj = -0.3, pch = 20, col = "gray40")

The plot shows the standardized residuals (or Pearson’s residuals) versus fitted values and suggests that there are outliers in the data (the names elements in the plots). To check if these outliers are a cause for concern, we will now use a Levene’s test to check if the variance is distributed homogeneously (homoscedasticity) or whether the assumption of variance homogeneity is violated (due to the outliers).

# check homogeneity
leveneTest(lmmdata$Prepositions, lmmdata$Genre, center = mean)
## Warning in leveneTest.default(lmmdata$Prepositions, lmmdata$Genre, center = mean): lmmdata$Genre coerced to factor.
## Levene's Test for Homogeneity of Variance (center = mean)
##        Df F value Pr(>F)  
## group  15    1.74   0.04 *
##       521                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Levene’s test shows that the variance is distributed unevenly across genres which means that we do not simply continue but should either remove problematic data points (outliers) or use a weighing method.

In this case, we create a new model which uses weights to compensate for heterogeneity of variance and thus the influence of outliers - which is an alternative to removing the data points and rerunning the analysis (Pinheiro and Bates 2000, 177). However, to do so, we need to use a different function (the lmer function) which means that we have to create two models: the “old” minimal adequate model and the “new” minimal adequate model with added weights. After we have created these models, we will compare them to see if including weights has improved the fit.

m4.lme = lme(Prepositions ~ Date, random = ~1|Genre, data = lmmdata, method = "ML")
m5.lme <- update(m4.lme, weights = varIdent(form = ~ 1 | Genre))
# compare models
anova(m5.lme, m4.lme)
##        Model df  AIC  BIC logLik   Test L.Ratio p-value
## m5.lme     1 19 4486 4567  -2224                       
## m4.lme     2  4 4495 4512  -2244 1 vs 2   39.17  0.0006

The weight model (m5.lme) that uses weights to account for unequal variance is performing significantly better than the model without weights (m4.lme) and we therefore switch to the weight model and inspect its parameters.

# inspect results
summary(m5.lme)        
## Linear mixed-effects model fit by maximum likelihood
##  Data: lmmdata 
##    AIC  BIC logLik
##   4486 4567  -2224
## 
## Random effects:
##  Formula: ~1 | Genre
##         (Intercept) Residual
## StdDev:       12.26    14.34
## 
## Variance function:
##  Structure: Different standard deviations per stratum
##  Formula: ~1 | Genre 
##  Parameter estimates:
##           Bible       Biography           Diary       Education         Fiction        Handbook         History             Law 
##          1.0000          0.3407          0.8695          0.7888          0.9117          1.0965          0.9787          0.7849 
##      Philosophy   PrivateLetter    PublicLetter        Religion         Science          Sermon          Travel TrialProceeding 
##          0.7370          1.1906          1.2189          0.9746          0.8486          0.9708          1.0862          1.2602 
## Fixed effects: Prepositions ~ Date 
##              Value Std.Error  DF t-value p-value
## (Intercept) 133.96    3.1444 520   42.60  0.0000
## Date          0.02    0.0055 520    3.99  0.0001
##  Correlation: 
##      (Intr)
## Date 0.004 
## 
## Standardized Within-Group Residuals:
##      Min       Q1      Med       Q3      Max 
## -3.31912 -0.67972  0.01469  0.69872  3.10388 
## 
## Number of Observations: 537
## Number of Groups: 16

We can also use an ANOVA display which is more to the point.

anova(m5.lme)          
##             numDF denDF F-value p-value
## (Intercept)     1   520  1813.9  <.0001
## Date            1   520    15.9  0.0001

As we did before, we now check, whether the final minimal model (with weights) outperforms an intercept-only base-line model.

# creat base-line model
m0.lme = lme(Prepositions ~ 1, random = ~1|Genre, data = lmmdata, method = "ML", weights = varIdent(form = ~ 1 | Genre))
anova(m5.lme, m0.lme)  # test if date is significant
##        Model df  AIC  BIC logLik   Test L.Ratio p-value
## m5.lme     1 19 4486 4567  -2224                       
## m0.lme     2 18 4496 4573  -2230 1 vs 2   12.44  0.0004

Our final minimal adequate model with weights performs significantly better than an intercept only base-line model. Before doing the final diagnostics, we well inspect the estimates for the random effect structure to check if there are values which require further inspection (e.g. because they are drastically different from all other values).

# extract estimates and sd for fixed and random effects
intervals(m5.lme)      
## Approximate 95% confidence intervals
## 
##  Fixed effects:
##                 lower      est.     upper
## (Intercept) 127.79833 133.96402 140.12970
## Date          0.01105   0.02174   0.03244
## attr(,"label")
## [1] "Fixed effects:"
## 
##  Random Effects:
##   Level: Genre 
##                 lower  est. upper
## sd((Intercept)) 8.525 12.26 17.64
## 
##  Variance function:
##                  lower   est.  upper
## Biography       0.2139 0.3407 0.5427
## Diary           0.6317 0.8695 1.1968
## Education       0.5792 0.7888 1.0743
## Fiction         0.6529 0.9117 1.2730
## Handbook        0.7838 1.0965 1.5340
## History         0.7325 0.9787 1.3076
## Law             0.5557 0.7849 1.1088
## Philosophy      0.4650 0.7370 1.1679
## PrivateLetter   0.9512 1.1906 1.4902
## PublicLetter    0.9472 1.2189 1.5686
## Religion        0.6694 0.9746 1.4190
## Science         0.5533 0.8486 1.3013
## Sermon          0.7296 0.9708 1.2918
## Travel          0.7927 1.0862 1.4883
## TrialProceeding 0.8689 1.2602 1.8275
## attr(,"label")
## [1] "Variance function:"
## 
##  Within-group standard error:
## lower  est. upper 
## 11.85 14.34 17.36

The random effect estimates do not show any outliers or drastically increased or decreased values which means that the random effect structure is fine.

Effect Sizes

We will now extract effect sizes (in the example: the effect size of date) and calculate normalized effect size measures (this effect size measure works for all fixed effects). To calculate the effect size, take the square root of the squared t-value divided by the t-value squared plus the degrees of freedom:

r = sqrt(t^2^/(t^2^+df)).

A brief word of warning is in order here: only apply this function to main effects that are not involved in interactions as they are meaningless because the amount of variance explained by main effects involved in interactions is unclear (Field, Miles, and Field 2012, 641).

# calculate effect size 
ggeffect(m5.lme)
## $Date
## 
## # Predicted values of Prepositions
## # x = Date
## 
##    x | Predicted |   SE |           95% CI
## ------------------------------------------
## -500 |    123.09 | 4.15 | [114.95, 131.24]
## -400 |    125.27 | 3.81 | [117.78, 132.76]
## -300 |    127.44 | 3.53 | [120.50, 134.38]
## -200 |    129.62 | 3.32 | [123.10, 136.13]
## -100 |    131.79 | 3.18 | [125.54, 138.04]
##    0 |    133.96 | 3.14 | [127.80, 140.13]
##  100 |    136.14 | 3.19 | [129.88, 142.40]
##  300 |    140.49 | 3.54 | [133.53, 147.45]
## 
## 
## attr(,"class")
## [1] "ggalleffects" "list"        
## attr(,"model.name")
## [1] "m5.lme"

While we have already shown that the effect of Date is significant, it is small which means that the number of prepositions per text does not correlate very strongly with time. This suggests that other factors that are not included in the model also impact the frequency of prepositions (and probably more meaningfully, too).

Before turning to the diagnostics, we will use the fitted (or predicted) and the observed values with a regression line for the predicted values. This will not only show how good the model fit the data but also the direction and magnitude of the effect.

# extract predicted values
lmmdata$Predicted <- predict(m5.lme, lmmdata)
# plot predicted values
ggplot(lmmdata, aes(DateUnscaled, Predicted)) +
  facet_wrap(~Genre) +
  geom_point(aes(x = DateUnscaled, y = Prepositions), color = "gray80", size = .5) +
  geom_smooth(aes(y = Predicted), color = "gray20", linetype = "solid", 
              se = T, method = "lm") +
  guides(color=guide_legend(override.aes=list(fill=NA))) +  
  theme_set(theme_bw(base_size = 10)) +
  theme(legend.position="top", legend.title = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) + 
  xlab("Date of composition")
## `geom_smooth()` using formula 'y ~ x'

Model Diagnostics

We now create diagnostic plots. What we wish to see in the diagnostic plots is a cloud of dots in the middle of the window without any structure. What we do not want to see is a funnel-shaped cloud because this indicates an increase of the errors/residuals with an increase of the predictor(s) (because this would indicate heteroscedasticity) (Pinheiro and Bates 2000, 182).

# start plotting
par(mfrow = c(2, 2))           # display plots in 2 rows and 2 columns
plot(m5.lme, pch = 20, col = "black", lty = "dotted")

par(mfrow = c(1, 1))

What a wonderful unstructured cloud - the lack of structure tells us that the model is “healthy” and does not suffer from heteroscedasticity. We will now create more diagnostic plots to find potential problems (Pinheiro and Bates 2000, 21).

# fitted values by Genre
plot(m5.lme, form = resid(., type = "p") ~ fitted(.) | Genre, abline = 0, 
     cex = .5, pch = 20, col = "black")

In contrast to the unweight model, no data points are named which indicates that the outliers do no longer have unwarranted influence on the model. Now, we check the residuals of fitted values against observed values (Pinheiro and Bates 2000, 179). What we would like to see is a straight, upwards going line.

# residuals of fitted values against observed
qqnorm(m5.lme, pch = 20, col = "black")

A beautiful, straight line! The qqplot does not indicate any problems. It is, unfortunately, rather common that the dots deviate from the straight line at the very bottom or the very tp which means that the model is good at estimating values around the middle of the dependent variable but rather bad at estimating lower or higher values. Next, we check the residuals by “Genre” (Pinheiro and Bates 2000, 179).

# residuals by genre
qqnorm(m5.lme, ~resid(.) | Genre, pch = 20, col = "black" )

Beautiful straight lines - perfection! Now, we inspect the observed responses versus the within-group fitted values (Pinheiro and Bates 2000, 178).

# observed responses versus the within-group fitted values
plot(m5.lme, Prepositions ~ fitted(.), id = 0.05, adj = -0.3, 
     xlim = c(80, 220), cex = .8, pch = 20, col = "blue")

Although some data points are named, the plot does not show any structure, like a funnel, which would have been problematic.

Reporting Results

Before we do the write-up, we have a look at the model summary as this will provide us with at least some of the parameters that we want to report.

summary(m5.lme)
## Linear mixed-effects model fit by maximum likelihood
##  Data: lmmdata 
##    AIC  BIC logLik
##   4486 4567  -2224
## 
## Random effects:
##  Formula: ~1 | Genre
##         (Intercept) Residual
## StdDev:       12.26    14.34
## 
## Variance function:
##  Structure: Different standard deviations per stratum
##  Formula: ~1 | Genre 
##  Parameter estimates:
##           Bible       Biography           Diary       Education         Fiction        Handbook         History             Law 
##          1.0000          0.3407          0.8695          0.7888          0.9117          1.0965          0.9787          0.7849 
##      Philosophy   PrivateLetter    PublicLetter        Religion         Science          Sermon          Travel TrialProceeding 
##          0.7370          1.1906          1.2189          0.9746          0.8486          0.9708          1.0862          1.2602 
## Fixed effects: Prepositions ~ Date 
##              Value Std.Error  DF t-value p-value
## (Intercept) 133.96    3.1444 520   42.60  0.0000
## Date          0.02    0.0055 520    3.99  0.0001
##  Correlation: 
##      (Intr)
## Date 0.004 
## 
## Standardized Within-Group Residuals:
##      Min       Q1      Med       Q3      Max 
## -3.31912 -0.67972  0.01469  0.69872  3.10388 
## 
## Number of Observations: 537
## Number of Groups: 16
tab_model(m5.lme)
  Prepositions
Predictors Estimates CI p
(Intercept) 133.96 127.80 – 140.13 <0.001
Date 0.02 0.01 – 0.03 <0.001
Random Effects
σ2 205.71
τ00 Genre 150.39
N Genre 16
Observations 537
Marginal R2 / Conditional R2 0.030 / NA

Note: The R2 values of the summary table are incorrect (as indicated by the missing conditional R2 value). The correct R2 values can be extracted using the r.squaredGLMM function from the MuMIn package. The respective call for the model is:

library(MuMIn)
r.squaredGLMM(m5.lme)
## Warning: 'r.squaredGLMM' now calculates a revised statistic. See the help page.
##         R2m    R2c
## [1,] 0.0174 0.4324

A mixed-effect linear regression model which contained the genre of texts as random effect was fit to the data in a step-wise-step up procedure. Due to the presence of outliers in the data, weights were included into the model which led to a significantly improved model fit compared to an unweight model (\(\chi\)2(2): 39.17, p: 0.0006). The final minimal adequate model performed significantly better than an intercept-only base-line model (\(\chi\)2(1): 52.87, p <.0001) and showed that the frequency of prepositions increases significantly but only marginally with the date of composition (\(\chi\)2(2): 10.12, p: 0.0063, Pearson’s \(r\) = 0.172). Neither the region where the text was composed nor a higher order interaction between genre and region significantly correlated with the use of prepositions in the data.

Remarks on Prediction

While the number of intercepts, the model-reports, and the way how mixed- and fixed-effects arrive at predictions differ, their predictions are identical (at least when dealing with a simple random effect structure). Consider the following example where we create analogous fixed and mixed effect models and plot their predicted frequencies of prepositions per genre across the unscaled date of composition. The predictions of the mixed-effects model are plotted as a solid red line, wehile the predictions of the fixed-effects model are plotted as dashed blue lines.

# creat lm model
m5.lmeunweight <- lm(Prepositions ~ DateUnscaled + Genre, data = lmmdata)
lmmdata$lmePredictions <- fitted(m5.lmeunweight, lmmdata)
m5.lm <- lm(Prepositions ~ DateUnscaled + Genre, data = lmmdata)
lmmdata$lmPredictions <- fitted(m5.lm, lmmdata)
# plot predictions
ggplot(lmmdata, aes(x = DateUnscaled, y = lmePredictions, group = Genre)) +
  geom_line(aes(y = lmmdata$lmePredictions), linetype = "solid", color = "red") +
  geom_line(aes(y = lmmdata$lmPredictions), linetype = "dashed", color = "blue") +
  facet_wrap(~ Genre, nrow = 4) +
  theme_bw() +
  labs(x = "Date of composition") +
  labs(y = "Prepositions per 1,000 words") +
  coord_cartesian(ylim = c(0, 220))

The predictions overlap perfectly which means that the predictions of both are identical - irrespective of whether genre is part of the mixed or the fixed effects structure.

2.2 Mixed-Effects Binomial Logistic Regression

We now turn to an extension of binomial logistic regression: mixed-effects binomial logistic regression. As is the case with linear mixed-effects models logistic mixed effects models have the following advantages over simpler statistical tests:

  • Mixed-effects models are multivariate, i.e. they test the effect of several predictors simultaneously while controlling for the effect of all other predictors.

  • Mixed models allow to statistically incorporate within-speaker variability and are thus fit to model hierarchical or nested data structures. This applies if several observations are produced by an individual speaker, for instance.

  • Mixed-models provide a wealth of diagnostic statistics which enables us to control e.g. multicollinearity, i.e. correlations between predictors, and to test whether conditions or requirements are violated (e.g. homogeneity of variance, etc.).

Major disadvantages of regression modeling are that they are prone to producing high \(\beta\)-errors (see Johnson 2009) and that they require rather large data sets.

Introduction

As is the case with linear mixed-effects models, binomial logistic mixed-effect models are multivariate analysis that treat data points as hierarchical or grouped in some way. In other words, they take into account that the data is nested in the sense that data points are produced by the same speaker or are grouped by some other characteristics. In mixed-models, hierarchical structures are modelled as random effects. If the random effect structure represents speakers then this means that a mixed-model would have a separate intercept and/or slope for each speaker.

Random Effects in linear models two parameters: the intercept (the point where the regression line crosses the y-axis) and the slope (the acclivity of the regression line). In contrast to linear mixed-effects models, random effects differ in the position and the slope of the logistic function that is applied to the likelihood of the dependent variable. random intercepts (centre left panel ) or various random slopes (centre right panel ), or both, various random intercepts and various random slopes (right panel ). In the following, we will only focus on models with random intercepts because this is the by far more common method and because including both random intercepts and random slopes requires huge amounts of data. Consider the Figure below to understand what is meant by “random intercepts”.

The upper left panel merely shows the logistic curve representing the predictions of a fixed-effects logistic regression with a single intercept and slope. The upper right panel shows the logistic curves representing the predictions of a of a mixed-effects logistic regression with random intercepts for each level of a grouping variable. The lower left panel shows the logistic curves representing the predictions of a mixed-effects logistic regression with one intercept but random slopes for each level of a grouping variable. The lower right panel shows the logistic curves representing the predictions of a mixed-effects logistic regression with random intercepts and random slopes for each level of a grouping variable.

After adding random intercepts, predictors (or fixed effects) are added to the model (just like with multiple regression). So mixed-effects are called mixed-effects because they contain both random and fixed effects.

In terms of general procedure, random effects are added first, and only after we have ascertained that including random effects is warranted, we test whether including fixed-effects is warranted (Field, Miles, and Field 2012). We test whether including random effects is warranted by comparing a model, that bases its estimates of the dependent variable solely on the base intercept, with a model that bases its estimates of the dependent variable solely on the intercepts of the random effect. If the mixed-effects model explains significantly more variance than the fixed-effects model without random effect structure, then we continue with the mixed-effects model. In other words, including random effects is justified if they reduce residual deviance.

Example: Discourse LIKE in Irish English

In this example we will investigate which factors correlate with the use of final discourse like (e.g. “The weather is shite, like!”) in Irish English. The data set represents speech units in a corpus that were coded for the speaker who uttered a given speech unit, the gender (Gender: Men versus Women) and age of that speaker (Age: Old versus Young), whether the interlocutors were of the same or a different gender (ConversationType: SameGender versus MixedGender), and whether another final discourse like had been used up to three speech units before (Priming: NoPrime versus Prime), whether or not the speech unit contained an final discourse like (SUFLike: 1 = yes, 0 = no. To begin with, we load the data and inspect the structure of the data set,

# load data
mblrdata <- read.table("https://slcladal.github.io/data/mblrdata.txt", 
                       comment.char = "",# data does not contain comments
                       quote = "",       # data does not contain quotes
                       sep = "\t",       # data is tab separated
                       header = T)       # data has column names
# inspect data
datatable(mblrdata, rownames = FALSE, options = list(pageLength = 5, scrollX=T), filter = "none")

As all variables except for the dependent variable (SUFlike) are character strings, we factorize the independent variables.

# def. variables to be factorized
vrs <- c("ID", "Age", "Gender", "ConversationType", "Priming")
# def. vector with variables
fctr <- which(colnames(mblrdata) %in% vrs)     
# factorize variables
mblrdata[,fctr] <- lapply(mblrdata[,fctr], factor)
# relevel Age (Young = Reference)
mblrdata$Age <- relevel(mblrdata$Age, "Young")
# order data by ID
mblrdata <- mblrdata %>%
  dplyr::arrange(ID)
# inspect data
datatable(mblrdata, rownames = FALSE, options = list(pageLength = 5, scrollX=T), filter = "none")

Before continuing, a few words about the minimum number of random effect levels and the minimum number of observations per random effect level are in order.

While many data points per random variable level increases statistical power and thus to more robust estimates of the random effects (Austin and Leckie 2018), it has been shown that small numbers of observations per random effect variable level do not cause serious bias and it does not negatively affect the estimates of the fixed-effects coefficients (Bell, Ferron, and Kromrey 2008; Clarke 2008; Clarke and Wheaton 2007; Maas and Hox 2005). The minimum number of observations per random effect variable level is therefore 1.

In simulation study, (Bell, Ferron, and Kromrey 2008) tested the impact of random variable levels with only a single observation ranging from 0 to 70 percent. As long as there was a relatively high number of random effect variable levels (500 or more), small numbers of observations had almost no impact on bias and Type 1 error control.

We now plot the data to inspect the relationships within the data set.

ggplot(mblrdata, aes(Gender, SUFlike, color = Priming)) +
  facet_wrap(Age~ConversationType) +
  stat_summary(fun = mean, geom = "point") +
  stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.2) +
  theme_set(theme_bw(base_size = 10)) +
  theme(legend.position = "top") +
  labs(x = "", y = "Observed Probabilty of discourse like") +
  scale_color_manual(values = c("gray20", "gray70"))

The upper left panel in the Figure above indicates that men sue discourse like more frequently than women. The centre right panel suggests that priming significantly increases the likelihood of discourse like being used. The centre left panel suggests that speakers use discourse like more frequently in mixed-gender conversations. However, the lower right panel indicates an interaction between gender and conversation type as women appear to use discourse like less frequently in same gender conversations while the conversation type does not seem to have an effect on men. After visualizing the data, we will now turn to the model building process.

Model Building

In a first step, we set the options and generate a distance matrix of the data.

# set options
options(contrasts  =c("contr.treatment", "contr.poly"))
mblrdata.dist <- datadist(mblrdata)
options(datadist = "mblrdata.dist")

In a next step, we generate fixed-effects minimal base-line models and a base-line mixed-model using the “glmer” function with a random intercept for ID (a lmer object of the final minimal adequate model will be created later).

# baseline model glm
m0.glm = glm(SUFlike ~ 1, family = binomial, data = mblrdata) 
# base-line mixed-model
m0.glmer = glmer(SUFlike ~ (1|ID), data = mblrdata, family = binomial) 

Testing the Random Effect

Now, we check if including the random effect is permitted by comparing the AICs from the glm to AIC from the glmer model. If the AIC of the glmer object is smaller than the AIC of the glm object, then this indicates that including random intercepts is justified.

aic.glmer <- AIC(logLik(m0.glmer))
aic.glm <- AIC(logLik(m0.glm))
aic.glmer; aic.glm
## [1] 1828
## [1] 1838

The AIC of the glmer object is smaller which shows that including the random intercepts is justified. To confirm whether the AIC reduction is sufficient for justifying the inclusion of a random-effect structure, we also test whether the mixed-effects minimal base-line model explains significantly more variance by applying a Model Likelihood Ratio Test to the fixed- and the mixed effects minimal base-line models.

# test random effects
null.id = -2 * logLik(m0.glm) + 2 * logLik(m0.glmer)
pchisq(as.numeric(null.id), df=1, lower.tail=F) 
## [1] 0.0006314
# sig m0.glmer better than m0.glm

The p-value of the Model Likelihood Ratio Test is lower than .05 which shows that the inclusion of the random-effects structure is warranted. We can now continue with the model fitting process.

Model Fitting

The next step is to fit the model which means that we aim to find the “best” model, i.e. the minimal adequate model. In this case, we will use a manual step-wise step-up, forward elimination procedure. Before we begin with the model fitting process we need to add ´control = glmerControl(optimizer = “bobyqa”)´ to avoid unnecessary failures to converge.

m0.glmer <- glmer(SUFlike ~ 1+ (1|ID), family = binomial, data = mblrdata, control=glmerControl(optimizer="bobyqa"))

During each step of the fitting procedure, we test whether certain assumptions on which the model relies are violated. To avoid incomplete information (a combination of variables does not occur in the data), we tabulate the variables we intend to include and make sure that all possible combinations are present in the data. Including variables although not all combinations are present in the data would lead to unreliable models that report (vastly) inaccurate results. A special case of incomplete information is complete separation which occurs if one predictor perfectly explains an outcome (in that case the incomplete information would be caused by a level of the dependent variable). In addition, we make sure that the VIFs do not exceed a maximum of 3 for main effects (Zuur, Ieno, and Elphick 2010) - Booth GD (1994) suggest that VIFs should ideally be lower than 1.5 - and 20 for interactions as higher values would indicate multicollinearity and thus that the model is unstable. The value of 20 should be taken with a pinch of salt because there is no clear consensus about what the maximum VIF for interactions should be or if it should be considered at all. The reason is that we would, of course, expect the VIFs to increase when we are dealing with interactions as the main effects that are part of the interaction are very likely to correlate with the interaction itself. However, if the VIFs are too high, then this will still cause the issues with the attribution of variance. The value of 20 was chosen as it is twice the most generous value for acceptable VIFs for main effects in the standard literature on multicollinearity (Zuur et al. 2009, @neter1990vif). Only once we have confirmed that the incomplete information, complete separation, and multicollinearity are not a major concern, we generate the more saturated model and test whether the inclusion of a predictor leads to a significant reduction in residual deviance. If the predictor explains a significant amount of variance, it is retained in the model while being disregarded in case it does not explain a sufficient quantity of variance.

# add Priming
ifelse(min(ftable(mblrdata$Priming, mblrdata$SUFlike)) == 0, "incomplete information", "okay")
## [1] "okay"
m1.glmer <- update(m0.glmer, .~.+Priming)
anova(m1.glmer, m0.glmer, test = "Chi") 
## Data: mblrdata
## Models:
## m0.glmer: SUFlike ~ 1 + (1 | ID)
## m1.glmer: SUFlike ~ (1 | ID) + Priming
##          npar  AIC  BIC logLik deviance Chisq Df          Pr(>Chisq)    
## m0.glmer    2 1828 1840   -912     1824                                 
## m1.glmer    3 1703 1720   -848     1697   128  1 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the tests do not show problems relating to incomplete information, because including Priming significantly improves the model fit (decrease in AIC and BIC values), and since it correlates significantly with our dependent variable, we include Priming into our model.

# add Age
ifelse(min(ftable(mblrdata$Age, mblrdata$SUFlike)) == 0, "incomplete information", "okay")
## [1] "okay"
m2.glmer <- update(m1.glmer, .~.+ Age)
ifelse(max(car::vif(m2.glmer)) <= 3,  "VIFs okay", "VIFs unacceptable") 
## [1] "VIFs okay"
anova(m2.glmer, m1.glmer, test = "Chi")   
## Data: mblrdata
## Models:
## m1.glmer: SUFlike ~ (1 | ID) + Priming
## m2.glmer: SUFlike ~ (1 | ID) + Priming + Age
##          npar  AIC  BIC logLik deviance Chisq Df Pr(>Chisq)
## m1.glmer    3 1703 1720   -848     1697                    
## m2.glmer    4 1704 1727   -848     1696  0.56  1       0.45
Anova(m2.glmer, test = "Chi")
## Analysis of Deviance Table (Type II Wald chisquare tests)
## 
## Response: SUFlike
##          Chisq Df          Pr(>Chisq)    
## Priming 129.51  1 <0.0000000000000002 ***
## Age       0.57  1                0.45    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVAs show that Age is not significant and the first ANOVA also shows that the BIC has increased which indicates that Age does not decrease variance. In such cases, the variable should not be included.

However, if the second ANOVA would report Age as being marginally significant, a case could be made for including it but it would be better to change the ordering in which predictors are added to the model. This is, however, just a theoretical issue here as Age is clearly not significant.

# add Gender
ifelse(min(ftable(mblrdata$Gender, mblrdata$SUFlike)) == 0, "incomplete information", "okay")
## [1] "okay"
m3.glmer <- update(m1.glmer, .~.+Gender)
ifelse(max(car::vif(m3.glmer)) <= 3,  "VIFs okay", "VIFs unacceptable") 
## [1] "VIFs okay"
anova(m3.glmer, m1.glmer, test = "Chi")
## Data: mblrdata
## Models:
## m1.glmer: SUFlike ~ (1 | ID) + Priming
## m3.glmer: SUFlike ~ (1 | ID) + Priming + Gender
##          npar  AIC  BIC logLik deviance Chisq Df Pr(>Chisq)    
## m1.glmer    3 1703 1720   -848     1697                        
## m3.glmer    4 1679 1702   -836     1671  25.4  1 0.00000047 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(m3.glmer, test = "Chi")
## Analysis of Deviance Table (Type II Wald chisquare tests)
## 
## Response: SUFlike
##         Chisq Df           Pr(>Chisq)    
## Priming 124.4  1 < 0.0000000000000002 ***
## Gender   28.6  1           0.00000009 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Gender is significant and will therefore be included as a predictor (you can also observe that including Gender has substantially decreased both AIC and BIC).

# add ConversationType
ifelse(min(ftable(mblrdata$ConversationType, mblrdata$SUFlike)) == 0, "incomplete information", "okay")
## [1] "okay"
m4.glmer <- update(m3.glmer, .~.+ConversationType)
ifelse(max(car::vif(m4.glmer)) <= 3,  "VIFs okay", "VIFs unacceptable") 
## [1] "VIFs okay"
anova(m4.glmer, m3.glmer, test = "Chi") 
## Data: mblrdata
## Models:
## m3.glmer: SUFlike ~ (1 | ID) + Priming + Gender
## m4.glmer: SUFlike ~ (1 | ID) + Priming + Gender + ConversationType
##          npar  AIC  BIC logLik deviance Chisq Df Pr(>Chisq)    
## m3.glmer    4 1679 1702   -836     1671                        
## m4.glmer    5 1669 1697   -829     1659  12.8  1    0.00034 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(m4.glmer, test = "Chi")
## Analysis of Deviance Table (Type II Wald chisquare tests)
## 
## Response: SUFlike
##                  Chisq Df           Pr(>Chisq)    
## Priming          130.7  1 < 0.0000000000000002 ***
## Gender            13.4  1              0.00025 ***
## ConversationType  13.0  1              0.00031 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ConversationType improves model fit (AIC and BIC decrease and it is reported as being significant) and will, therefore, be included in the model.

# add Priming*Age
ifelse(min(ftable(mblrdata$Priming, mblrdata$Age, mblrdata$SUFlike)) == 0, "incomplete information", "okay")
## [1] "okay"
m5.glmer <- update(m4.glmer, .~.+Priming*Age)
ifelse(max(car::vif(m5.glmer)) <= 10,  "VIFs okay", "WARNING: high VIFs!") 
## [1] "VIFs okay"
anova(m5.glmer, m4.glmer, test = "Chi") 
## Data: mblrdata
## Models:
## m4.glmer: SUFlike ~ (1 | ID) + Priming + Gender + ConversationType
## m5.glmer: SUFlike ~ (1 | ID) + Priming + Gender + ConversationType + Age + 
## m5.glmer:     Priming:Age
##          npar  AIC  BIC logLik deviance Chisq Df Pr(>Chisq)
## m4.glmer    5 1669 1697   -829     1659                    
## m5.glmer    7 1672 1711   -829     1658  0.98  2       0.61

The interaction between Priming and Age is not significant and will thus not be included.

# add Priming*Gender
ifelse(min(ftable(mblrdata$Priming, mblrdata$Gender, mblrdata$SUFlike)) == 0, "incomplete information", "okay")
## [1] "okay"
m6.glmer <- update(m4.glmer, .~.+Priming*Gender)
ifelse(max(car::vif(m6.glmer)) <= 10,  "VIFs okay", "WARNING: high VIFs!") 
## [1] "VIFs okay"
anova(m6.glmer, m4.glmer, test = "Chi") 
## Data: mblrdata
## Models:
## m4.glmer: SUFlike ~ (1 | ID) + Priming + Gender + ConversationType
## m6.glmer: SUFlike ~ (1 | ID) + Priming + Gender + ConversationType + Priming:Gender
##          npar  AIC  BIC logLik deviance Chisq Df Pr(>Chisq)   
## m4.glmer    5 1669 1697   -829     1659                       
## m6.glmer    6 1663 1697   -826     1651  7.42  1     0.0065 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(m6.glmer, test = "Chi")
## Analysis of Deviance Table (Type II Wald chisquare tests)
## 
## Response: SUFlike
##                   Chisq Df           Pr(>Chisq)    
## Priming          131.96  1 < 0.0000000000000002 ***
## Gender            13.58  1              0.00023 ***
## ConversationType  10.71  1              0.00107 ** 
## Priming:Gender     7.45  1              0.00633 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The interaction between Priming and Gender improved model fit (AIC and BIC reduction) and significantly correlates with the use of speech-unit final LIKE. It will therefore be included in the model.

# add Priming*ConversationType
ifelse(min(ftable(mblrdata$Priming, mblrdata$ConversationType, mblrdata$SUFlike)) == 0, "incomplete information", "okay")
## [1] "okay"
m7.glmer <- update(m6.glmer, .~.+Priming*ConversationType)
ifelse(max(car::vif(m7.glmer)) <= 10,  "VIFs okay", "WARNING: high VIFs!") 
## [1] "VIFs okay"
anova(m7.glmer, m6.glmer, test = "Chi")
## Data: mblrdata
## Models:
## m6.glmer: SUFlike ~ (1 | ID) + Priming + Gender + ConversationType + Priming:Gender
## m7.glmer: SUFlike ~ (1 | ID) + Priming + Gender + ConversationType + Priming:Gender + 
## m7.glmer:     Priming:ConversationType
##          npar  AIC  BIC logLik deviance Chisq Df Pr(>Chisq)
## m6.glmer    6 1663 1697   -826     1651                    
## m7.glmer    7 1663 1702   -825     1649  2.03  1       0.15

The interaction between Priming and ConversationType does not significantly correlate with the use of speech-unit final LIKE and it does not explain much variance (AIC and BIC increase). It will be not be included in the model.

# add Age*Gender
ifelse(min(ftable(mblrdata$Age, mblrdata$Gender, mblrdata$SUFlike)) == 0, "incomplete information", "okay")
## [1] "okay"
m8.glmer <- update(m6.glmer, .~.+Age*Gender)
ifelse(max(car::vif(m8.glmer)) <= 10,  "VIFs okay", "WARNING: high VIFs!") 
## [1] "VIFs okay"
anova(m8.glmer, m6.glmer, test = "Chi") 
## Data: mblrdata
## Models:
## m6.glmer: SUFlike ~ (1 | ID) + Priming + Gender + ConversationType + Priming:Gender
## m8.glmer: SUFlike ~ (1 | ID) + Priming + Gender + ConversationType + Age + 
## m8.glmer:     Priming:Gender + Gender:Age
##          npar  AIC  BIC logLik deviance Chisq Df Pr(>Chisq)
## m6.glmer    6 1663 1697   -826     1651                    
## m8.glmer    8 1666 1710   -825     1650  1.53  2       0.47

The interaction between Age and Gender is not significant and will thus continue without it.

# add Age*ConversationType
ifelse(min(ftable(mblrdata$Age, mblrdata$ConversationType, mblrdata$SUFlike)) == 0, "incomplete information", "okay")
## [1] "okay"
m9.glmer <- update(m6.glmer, .~.+Age*ConversationType)
ifelse(max(car::vif(m9.glmer)) <= 10,  "VIFs okay", "WARNING: high VIFs!") 
## [1] "VIFs okay"
anova(m9.glmer, m6.glmer, test = "Chi") 
## Data: mblrdata
## Models:
## m6.glmer: SUFlike ~ (1 | ID) + Priming + Gender + ConversationType + Priming:Gender
## m9.glmer: SUFlike ~ (1 | ID) + Priming + Gender + ConversationType + Age + 
## m9.glmer:     Priming:Gender + ConversationType:Age
##          npar  AIC  BIC logLik deviance Chisq Df Pr(>Chisq)
## m6.glmer    6 1663 1697   -826     1651                    
## m9.glmer    8 1666 1711   -825     1650   0.9  2       0.64

The interaction between Age and ConversationType is not significant and will thus continue without it.

# add Gender*ConversationType
ifelse(min(ftable(mblrdata$Gender, mblrdata$ConversationType, mblrdata$SUFlike)) == 0, "incomplete information", "okay")
## [1] "okay"
m10.glmer <- update(m6.glmer, .~.+Gender*ConversationType)
ifelse(max(car::vif(m10.glmer)) <= 10,  "VIFs okay", "WARNING: high VIFs!") 
## [1] "VIFs okay"
anova(m10.glmer, m6.glmer, test = "Chi") 
## Data: mblrdata
## Models:
## m6.glmer: SUFlike ~ (1 | ID) + Priming + Gender + ConversationType + Priming:Gender
## m10.glmer: SUFlike ~ (1 | ID) + Priming + Gender + ConversationType + Priming:Gender + 
## m10.glmer:     Gender:ConversationType
##           npar  AIC  BIC logLik deviance Chisq Df Pr(>Chisq)
## m6.glmer     6 1663 1697   -826     1651                    
## m10.glmer    7 1665 1704   -825     1651  0.34  1       0.56

The interaction between Gender and ConversationType is insignificant and does not improve model fit (AIC and BIC reduction). It will therefore not be included in the model.

# add Priming*Age*Gender
ifelse(min(ftable(mblrdata$Priming,mblrdata$Age, mblrdata$Gender, mblrdata$SUFlike)) == 0, "incomplete information", "okay")
## [1] "okay"
m11.glmer <- update(m6.glmer, .~.+Priming*Age*Gender)
ifelse(max(car::vif(m11.glmer)) <= 10,  "VIFs okay", "WARNING: high VIFs!") 
## [1] "VIFs okay"
anova(m11.glmer, m6.glmer, test = "Chi") 
## Data: mblrdata
## Models:
## m6.glmer: SUFlike ~ (1 | ID) + Priming + Gender + ConversationType + Priming:Gender
## m11.glmer: SUFlike ~ (1 | ID) + Priming + Gender + ConversationType + Age + 
## m11.glmer:     Priming:Gender + Priming:Age + Gender:Age + Priming:Gender:Age
##           npar  AIC  BIC logLik deviance Chisq Df Pr(>Chisq)
## m6.glmer     6 1663 1697   -826     1651                    
## m11.glmer   10 1668 1724   -824     1648  3.26  4       0.51

The interaction between Priming, Age and Gender is not significant and we will thus continue without it.

# add Priming*Age*ConversationType
ifelse(min(ftable(mblrdata$Priming,mblrdata$Age, mblrdata$ConversationType, mblrdata$SUFlike)) == 0, "incomplete information", "okay")
## [1] "okay"
m12.glmer <- update(m6.glmer, .~.+Priming*Age*ConversationType)
ifelse(max(car::vif(m12.glmer)) <= 10,  "VIFs okay", "WARNING: high VIFs!") 
## [1] "VIFs okay"
anova(m12.glmer, m6.glmer, test = "Chi")
## Data: mblrdata
## Models:
## m6.glmer: SUFlike ~ (1 | ID) + Priming + Gender + ConversationType + Priming:Gender
## m12.glmer: SUFlike ~ (1 | ID) + Priming + Gender + ConversationType + Age + 
## m12.glmer:     Priming:Gender + Priming:Age + Priming:ConversationType + 
## m12.glmer:     ConversationType:Age + Priming:ConversationType:Age
##           npar  AIC  BIC logLik deviance Chisq Df Pr(>Chisq)
## m6.glmer     6 1663 1697   -826     1651                    
## m12.glmer   11 1666 1728   -822     1644  6.92  5       0.23

The interaction between Priming, Age and ConversationType is not significant and we will thus continue without it.

# add Priming*Gender*ConversationType
ifelse(min(ftable(mblrdata$Priming,mblrdata$Gender, mblrdata$ConversationType, mblrdata$SUFlike)) == 0, "incomplete information", "okay")
## [1] "okay"
m13.glmer <- update(m6.glmer, .~.+Priming*Gender*ConversationType)
ifelse(max(car::vif(m13.glmer)) <= 10,  "VIFs okay", "WARNING: high VIFs!") 
## [1] "WARNING: high VIFs!"
anova(m13.glmer, m6.glmer, test = "Chi")
## Data: mblrdata
## Models:
## m6.glmer: SUFlike ~ (1 | ID) + Priming + Gender + ConversationType + Priming:Gender
## m13.glmer: SUFlike ~ (1 | ID) + Priming + Gender + ConversationType + Priming:Gender + 
## m13.glmer:     Priming:ConversationType + Gender:ConversationType + Priming:Gender:ConversationType
##           npar  AIC  BIC logLik deviance Chisq Df Pr(>Chisq)  
## m6.glmer     6 1663 1697   -826     1651                      
## m13.glmer    9 1660 1710   -821     1642  9.66  3      0.022 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(m13.glmer, test = "Chi")
## Analysis of Deviance Table (Type II Wald chisquare tests)
## 
## Response: SUFlike
##                                  Chisq Df           Pr(>Chisq)    
## Priming                         129.35  1 < 0.0000000000000002 ***
## Gender                           11.83  1              0.00058 ***
## ConversationType                  8.53  1              0.00349 ** 
## Priming:Gender                    3.49  1              0.06162 .  
## Priming:ConversationType          1.81  1              0.17878    
## Gender:ConversationType           0.15  1              0.69391    
## Priming:Gender:ConversationType   6.27  1              0.01230 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Although the VIFs are higher than 20, the maximum value is 25.868 and thus not excessive. And since the three-way interaction between Priming, Gender, and ConversationType is not only significantly correlated with the dependent variable but also decreases AIC, we will include it in the model. However, it should be noted that there is a BIC increase(!) - it could be argued that this argues against including this three-way interaction. This is a judgement-call and depends on whether significance, AIC, or BIC is the main criterion for including predictors. Since we chose a p-value based approach here and have so far focused more on the AIC, we decide to disregard the BIC increase (it would be good practice to state that a BIC increase had taken place in a footnote though).

# add Age*Gender*ConversationType
ifelse(min(ftable(mblrdata$Age,mblrdata$Gender, mblrdata$ConversationType, mblrdata$SUFlike)) == 0, "incomplete information", "okay")
## [1] "okay"
m14.glmer <- update(m13.glmer, .~.+Age*Gender*ConversationType)
ifelse(max(car::vif(m14.glmer)) <= 10,  "VIFs okay", "WARNING: high VIFs!") 
## [1] "WARNING: high VIFs!"
car::vif(m14.glmer)
##                         Priming                          Gender                ConversationType                             Age 
##                           7.712                           2.649                          31.165                           3.796 
##                  Priming:Gender        Priming:ConversationType         Gender:ConversationType                      Gender:Age 
##                          10.286                          23.859                          33.433                           5.814 
##            ConversationType:Age Priming:Gender:ConversationType     Gender:ConversationType:Age 
##                          19.786                          24.950                          21.846

In this case, the VIFs are excessive and have values higher than 30 which shows an unacceptable degree of multicollinearity so that we abort and move on the next model.

# add Priming*Age*Gender*ConversationType
ifelse(min(ftable(mblrdata$Priming,mblrdata$Age,mblrdata$Gender, mblrdata$ConversationType, mblrdata$SUFlike)) == 0, "incomplete information", "okay")
## [1] "incomplete information"
m15.glmer <- update(m13.glmer, .~.+Priming*Age*Gender*ConversationType)
ifelse(max(car::vif(m15.glmer)) <= 10,  "VIFs okay", "WARNING: high VIFs!") 
## [1] "WARNING: high VIFs!"
car::vif(m15.glmer)
##                             Priming                              Gender                    ConversationType 
##                              15.742                               2.786                              28.061 
##                                 Age                      Priming:Gender            Priming:ConversationType 
##                               3.900                              18.715                              30.022 
##             Gender:ConversationType                         Priming:Age                          Gender:Age 
##                              30.344                               9.501                               6.695 
##                ConversationType:Age     Priming:Gender:ConversationType                  Priming:Gender:Age 
##                        31637533.686                              31.550                              11.586 
##        Priming:ConversationType:Age         Gender:ConversationType:Age Priming:Gender:ConversationType:Age 
##                         9086362.739                        30682029.288                         7981201.727

Again, the VIFs are excessive! As this was the last possible model, we have found our final minimal adequate model in m13.glmer.

In a next step, we create an overview of model comparisons which serves as a summary for the model fitting process and provides AIC, BIC, and \(\chi\)2 values.

source("https://slcladal.github.io/rscripts/ModelFittingSummarySWSU.r") 
# comparisons of glmer objects
m1.m0 <- anova(m1.glmer, m0.glmer, test = "Chi") 
m2.m1 <- anova(m2.glmer, m1.glmer, test = "Chi")   
m3.m1 <- anova(m3.glmer, m1.glmer, test = "Chi")
m4.m3 <- anova(m4.glmer, m3.glmer, test = "Chi") 
m5.m4 <- anova(m5.glmer, m4.glmer, test = "Chi") 
m6.m4 <- anova(m6.glmer, m4.glmer, test = "Chi") 
m7.m6 <- anova(m7.glmer, m6.glmer, test = "Chi")
m8.m6 <- anova(m8.glmer, m6.glmer, test = "Chi") 
m9.m6 <- anova(m9.glmer, m6.glmer, test = "Chi") 
m10.m6 <- anova(m10.glmer, m6.glmer, test = "Chi") 
m11.m6 <- anova(m11.glmer, m6.glmer, test = "Chi") 
m12.m6 <- anova(m12.glmer, m6.glmer, test = "Chi")
m13.m6 <- anova(m13.glmer, m6.glmer, test = "Chi")
# create a list of the model comparisons
mdlcmp <- list(m1.m0, m2.m1, m3.m1, m4.m3, m5.m4, m6.m4, m7.m6, m8.m6, m9.m6, m10.m6, m11.m6, m12.m6, m13.m6)
# summary table for model fitting
mdlft <- mdl.fttng.swsu(mdlcmp)
mdlft <- mdlft[,-2]
# inspect data
datatable(mdlft, rownames = FALSE, filter = "none")

We now rename our final minimal adequate model, test whether it performs significantly better than the minimal base-line model, and print the regression summary.

mlr.glmer <- m13.glmer # rename final minimal adequate model
anova(mlr.glmer, m0.glmer, test = "Chi") # final model better than base-line model
## Data: mblrdata
## Models:
## m0.glmer: SUFlike ~ 1 + (1 | ID)
## mlr.glmer: SUFlike ~ (1 | ID) + Priming + Gender + ConversationType + Priming:Gender + 
## mlr.glmer:     Priming:ConversationType + Gender:ConversationType + Priming:Gender:ConversationType
##           npar  AIC  BIC logLik deviance Chisq Df          Pr(>Chisq)    
## m0.glmer     2 1828 1840   -912     1824                                 
## mlr.glmer    9 1660 1710   -821     1642   183  7 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
print(mlr.glmer, corr = F) # inspect final minimal adequate model
## Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
##  Family: binomial  ( logit )
## Formula: SUFlike ~ (1 | ID) + Priming + Gender + ConversationType + Priming:Gender +  
##     Priming:ConversationType + Gender:ConversationType + Priming:Gender:ConversationType
##    Data: mblrdata
##      AIC      BIC   logLik deviance df.resid 
##   1659.5   1709.9   -820.8   1641.5     1991 
## Random effects:
##  Groups Name        Std.Dev.
##  ID     (Intercept) 0.315   
## Number of obs: 2000, groups:  ID, 208
## Fixed Effects:
##                                         (Intercept)                                         PrimingPrime  
##                                              -0.807                                                0.371  
##                                         GenderWomen                           ConversationTypeSameGender  
##                                              -1.003                                               -1.809  
##                            PrimingPrime:GenderWomen              PrimingPrime:ConversationTypeSameGender  
##                                               1.680                                                2.486  
##              GenderWomen:ConversationTypeSameGender  PrimingPrime:GenderWomen:ConversationTypeSameGender  
##                                               1.342                                               -2.421

To extract the effect sizes of the significant fixed effects, we compare the model with that effect to a model without that effect so that we can ascertain how much variance that effect explains. In our case, this is purely to show how to do this because main effects are superseded by interactions in which they are involved and should therefore not be interpreted (Field, Miles, and Field 2012, 622).

anova(m1.glmer, m0.glmer, test = "Chi") 
## Data: mblrdata
## Models:
## m0.glmer: SUFlike ~ 1 + (1 | ID)
## m1.glmer: SUFlike ~ (1 | ID) + Priming
##          npar  AIC  BIC logLik deviance Chisq Df          Pr(>Chisq)    
## m0.glmer    2 1828 1840   -912     1824                                 
## m1.glmer    3 1703 1720   -848     1697   128  1 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Visualizing Effects

As we will see the effects in the final summary, we visualize the effects here by showing the probability of discourse like based on the predicted values.

# extract predicted values
mblrdata$Predicted <- predict(m13.glmer, mblrdata, type = "response")
# plot
ggplot(mblrdata, aes(Gender, Predicted, color = Priming)) +
  facet_wrap(~ConversationType) +
  stat_summary(fun = mean, geom = "point") +
  stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.2) +
  theme_set(theme_bw(base_size = 10)) +
  theme(legend.position = "top") +
    ylim(0, .75) +
  labs(x = "", y = "Predicted Probabilty of discourse like") +
  scale_color_manual(values = c("gray20", "gray70"))

A proper visualization of the marginal effects can be extracted using the sjPlot package.

plot_model(m13.glmer, type = "pred", terms = c("ConversationType", "Priming", "Gender"))

We can see that discourse like is more likely to surface in primed contexts but that in contrast to women and men in same-gender conversations as well as women in mixed-gender conversations, priming appears to affect the use of discourse like by men in mixed-gender conversations only very little.

Extracting Model Fit Parameters

We now extract model fit parameters (Baayen 2008, 281).

probs = 1/(1+exp(-fitted(mlr.glmer)))
probs = binomial()$linkinv(fitted(mlr.glmer))
somers2(probs, as.numeric(mblrdata$SUFlike))
##         C       Dxy         n   Missing 
##    0.7646    0.5293 2000.0000    0.0000

The model fit parameters indicate a suboptimal fit. Both the C-value and Somers’s Dxy show poor fit between predicted and observed occurrences of SUFlike. If the C-value is 0.5, the predictions are random, while the predictions are perfect if the C-value is 1. C-values above 0.8 indicate real predictive capacity (Baayen 2008, 204). Somers’ Dxy is a value that represents a rank correlation between predicted probabilities and observed responses. Somers’ Dxy values range between 0, which indicates complete randomness, and 1, which indicates perfect prediction (Baayen 2008, 204). This a value of .2646 suggests that the model performs better than chance but not substantially so. We will now perform the model diagnostics.

Model Diagnostics

We begin the model diagnostics by generating a diagnostic that plots the fitted or predicted values against the residuals.

plot(mlr.glmer, pch = 20, col = "black", lty = "dotted")

As a final step, we summarize our findings in tabulated form.

  SUFlike
Predictors Odds Ratios CI p
(Intercept) 0.45 0.32 – 0.61 <0.001
Priming [Prime] 1.45 0.61 – 3.47 0.405
Gender [Women] 0.37 0.24 – 0.57 <0.001
ConversationType
[SameGender]
0.16 0.05 – 0.59 0.006
Priming [Prime] * Gender
[Women]
5.37 1.78 – 16.16 0.003
Priming [Prime] *
ConversationType
[SameGender]
12.02 2.15 – 66.99 0.005
Gender [Women] *
ConversationType
[SameGender]
3.83 1.02 – 14.39 0.047
(Priming [Prime] * Gender
[Women]) *
ConversationType
[SameGender]
0.09 0.01 – 0.59 0.012
Random Effects
σ2 3.29
τ00 ID 0.10
ICC 0.03
N ID 208
Observations 2000
Marginal R2 / Conditional R2 0.144 / 0.169

A mixed-effect binomial logistic regression model which contained speaker as random effect was fit to the data in a step-wise-step up procedure. The final minimal adequate model performed significantly better than an intercept-only base line model (\(\chi\)2(8): 842.06, p <.0001) but has a suboptimal fit (C: 0.765, Somers’ Dxy: .529). The final minimal adequate model reported a significant interaction between the gender of speakers, the type of conversation and priming (\(\chi\)2(3): 9.66, p = .022). The model showed that discourse like is more likely to surface in primed contexts but that in contrast to women and men in same-gender conversations as well as women in mixed-gender conversations, priming appears to affect the use of discourse like by men in mixed-gender conversations only to a marginal extent.

2.3 Mixed-Effects (Quasi-)Poisson and Negative-Binomial Regression

Like fixed-effects Poisson models, mixed-effects Poisson models take counts as dependent variables. The data for this analysis was collected on three separate evenings (Trial). The number of the filler “uhm” (UHM) was counted in two-minute conversations that were either in English, German, Russian, or Mandarin (Language). In addition, the number of shots that speakers drank before they talked was recorded (Shots).

# load data
countdata <- read.table("https://slcladal.github.io/data/countdata.txt", 
                       comment.char = "",quote = "", sep = "\t", header = T) 
# inspect data
str(countdata)
## 'data.frame':    500 obs. of  6 variables:
##  $ ID      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Trial   : int  3 3 3 1 1 3 1 3 3 2 ...
##  $ Language: chr  "Russian" "Russian" "German" "German" ...
##  $ Gender  : chr  "Man" "Man" "Man" "Man" ...
##  $ UHM     : int  0 0 0 0 2 1 1 0 0 0 ...
##  $ Shots   : int  0 0 5 3 6 5 1 4 0 2 ...

Since the data contains character variables, we need to factorize the data before we can analyse it further and we also remove the ID column.

# factorize variables
countdata <- countdata %>%
  dplyr::select(-ID)
clfct <- c("Trial", "Language", "Gender")
countdata[clfct] <- lapply(countdata[clfct], factor)
 # inspect data
str(countdata); head(countdata)
## 'data.frame':    500 obs. of  5 variables:
##  $ Trial   : Factor w/ 3 levels "1","2","3": 3 3 3 1 1 3 1 3 3 2 ...
##  $ Language: Factor w/ 4 levels "English","German",..: 4 4 2 2 2 2 3 2 4 2 ...
##  $ Gender  : Factor w/ 2 levels "Man","Woman": 1 1 1 1 2 1 1 2 2 1 ...
##  $ UHM     : int  0 0 0 0 2 1 1 0 0 0 ...
##  $ Shots   : int  0 0 5 3 6 5 1 4 0 2 ...
##   Trial Language Gender UHM Shots
## 1     3  Russian    Man   0     0
## 2     3  Russian    Man   0     0
## 3     3   German    Man   0     5
## 4     1   German    Man   0     3
## 5     1   German  Woman   2     6
## 6     3   German    Man   1     5

After the data is factorized, we can visualize the data.

countdata %>%
  # prepare data
  dplyr::select(Language, Shots) %>%
  dplyr::group_by(Language) %>%
  dplyr::mutate(Mean = round(mean(Shots), 1)) %>%
  dplyr::mutate(SD = round(sd(Shots), 1)) %>%
  # start plot
  ggplot(aes(Language, Shots, color = Language, fill = Language)) +
  geom_violin(trim=FALSE, color = "gray20")+ 
  geom_boxplot(width=0.1, fill="white", color = "gray20") +
  geom_text(aes(y=-4,label=paste("mean: ", Mean, sep = "")), size = 3, color = "black") +
  geom_text(aes(y=-5,label=paste("SD: ", SD, sep = "")), size = 3, color = "black") +
  scale_fill_manual(values=rep("grey90",4)) + 
  theme_set(theme_bw(base_size = 10)) +
  theme(legend.position="none", legend.title = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) + 
  ylim(-5, 15) +
  labs(x = "Language", y = "Shots")
## Warning: Removed 2 rows containing missing values (geom_violin).

The violin plots show that the English speakers drank more shots than speakers of other languages with Mandarin speakers drinking the fewest shots.

In the present case, we will a Boruta variable selection procedure to streamline the model fitting process. Thus, before fitting the model, we will test which variables have any kind of relationship with the dependent variable and therefore deserve to be evaluated in the regression modeling. As this is just an example, we will only consider variables which are deemed important and disregard both unimportant and tentative variables. We start the Boruta analysis by setting a seed and running an initial Boruta analysis.

# perform variable selection
set.seed(20191220)
boruta <- Boruta(UHM ~.,data=countdata)
print(boruta)
## Boruta performed 99 iterations in 6.193 secs.
##  1 attributes confirmed important: Shots;
##  1 attributes confirmed unimportant: Gender;
##  2 tentative attributes left: Language, Trial;

As only Shots is confirmed as important, we will only check for the effect of Shots and include Language as a random effect in the regression modeling. Including Language as a random effect is probably not justified statistically (given that the Boruta analysis showed that it only has a tentative effect) but for theoretical reasons as the speakers are nested into Languages. Before we start with the modeling, however, we proceed by checking if the data does indeed approximate a Poisson distribution.

# output the results
gf = goodfit(countdata$UHM,type= "poisson", method= "ML")
plot(gf,main="Count data vs Poisson distribution")

The data does not perfectly match a distribution that would be expected if the data approximated a Poisson distribution. We will use a goodness-of-fit test to check if the data does indeed diverge significantly from being Poisson distributed. If the p-values of the goodness-of-fit test is smaller than .05, then the distribution of the data differs significantly from a Poisson distribution and, given the visualization is likely over-dispersed.

In case of overdispersion, we may have to use a quasi-Poisson or, even better, a negative binomial model but we will, for now continue with the Poisson model and perform diagnostics later to check if we have to switch to a more robust method. One effect of overdispersion is that the standard errors of a model are biased and quasi-Poisson models scale the standard errors to compensate bias. However, Zuur, Hilbe, and Ieno (2013) suggest to use negative-binomial model instead. This is so because the scaling of the standard errors performed by quasi-Poisson models only affects the significance of coefficients (the p-values) but it does not affect the coefficients which, however, may be affected themselves by overdispersion. Thus, the coefficients of Poisson as well as quasi-Poisson models (which are identical) may be unreliable when dealing with overdispersion. Negative binomial models, in contrast, include an additional dispersion or heterogeneity parameter which accommodates overdispersion better than merely scaling the standard errors (see Zuur, Hilbe, and Ieno 2013, 21).

summary(gf)
## 
##   Goodness-of-fit test for poisson distribution
## 
##                    X^2 df                             P(> X^2)
## Likelihood Ratio 153.4  5 0.0000000000000000000000000000002493

The p-value is indeed smaller than .05 which means that we should indeed use a negative-binomial model rather than a Poisson model. We will ignore this, for now, and proceed to fit a Poisson mixed-effects model and check what happens if a Poisson model is fit to over-dispersed data.

Mixed-Effects Poisson Regression

In a first step, we create mixed-effect intercept-only baseline models and then test if including “Shots” significantly improves model fit and, thus, has a significant impact on the number of uhms.

# base-line mixed-model
m0.glmer = glmer(UHM ~ 1 + (1 | Language), data = countdata, family = poisson,
                 control=glmerControl(optimizer="bobyqa"))
# add Shots
m1.glmer <- update(m0.glmer, .~.+ Shots)
## boundary (singular) fit: see ?isSingular
Anova(m1.glmer, test = "Chi")           
## Analysis of Deviance Table (Type II Wald chisquare tests)
## 
## Response: UHM
##       Chisq Df          Pr(>Chisq)    
## Shots   321  1 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA confirms that Shots have a significant impact on the number of instances of uhm. However, we get the warning that the fitted mixed model is (almost / near) singular. In such cases, the model should not be reported. As this is only an example, we will continue by having a look at the model summary.

summary(m1.glmer)
## Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
##  Family: poisson  ( log )
## Formula: UHM ~ (1 | Language) + Shots
##    Data: countdata
## Control: glmerControl(optimizer = "bobyqa")
## 
##      AIC      BIC   logLik deviance df.resid 
##   1041.8   1054.5   -517.9   1035.8      497 
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -1.510 -0.666 -0.593  0.586  4.339 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Language (Intercept) 0        0       
## Number of obs: 500, groups:  Language, 4
## 
## Fixed effects:
##             Estimate Std. Error z value            Pr(>|z|)    
## (Intercept)  -1.2789     0.0893   -14.3 <0.0000000000000002 ***
## Shots         0.2336     0.0130    17.9 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##       (Intr)
## Shots -0.806
## optimizer (bobyqa) convergence code: 0 (OK)
## boundary (singular) fit: see ?isSingular

The model summary confirms that the number of shots does have a significantly positive effect on the number of occurrences of uhm. Furthermore, the scaled residuals are distributed very unevenly which suggests overdispersion. Including Language as a random effect is not justified given that they have 0 variance and a standard deviation of 0 (which means that Language does not account for or explain any additional variance).

We now check if the model suffers from overdispersion following (Zuur, Hilbe, and Ieno (2013) 138).

# extract pearson residuals
PearsonResiduals <- resid(m1.glmer, type = "pearson")
# extract number of cases in model
Cases <- nrow(countdata)
# extract number of predictors (plus intercept)
NumberOfPredictors <- length(fixef(m1.glmer)) +1
# calculate overdispersion
Overdispersion <- sum(PearsonResiduals^2) / (Cases-NumberOfPredictors)
# inspect overdispersion
Overdispersion
## [1] 1.169

The data is slightly over-dispersed. It would also be advisable to plot the Cook’s distance (which should not show data points with values > 1). If there are data points with high Cook’s D values, we could exclude them which would, very likely reduce the overdispersion (see Zuur, Hilbe, and Ieno 2013, 22). We ignore this, for now, and use diagnostic plots to check if the plots indicate problems.

diag_data <- data.frame(PearsonResiduals, fitted(m1.glmer))
colnames(diag_data) <- c("Pearson", "Fitted")
p9 <- ggplot(diag_data, aes(x = Fitted, y = Pearson)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dotted")
p10 <- ggplot(countdata, aes(x = Shots, y = diag_data$Pearson)) +
  geom_point()  +
  geom_hline(yintercept = 0, linetype = "dotted") +
  labs(y = "Pearson")
p11 <- ggplot(countdata, aes(x = Language, y = diag_data$Pearson)) +
  geom_boxplot() +
  labs(y = "Pearson") + 
  theme(axis.text.x = element_text(angle=90))
grid.arrange(p9, p10, p11, nrow = 1)

The diagnostic plots show problems as the dots in the first two plots are not random but show a pattern in the lower left corner. In addition, the variance of English (left boxplot) is notable larger than the variance of Russian (right boxplot). As a final step, we plot the predicted vales of the model to check if the predictions make sense.

plot_model(m1.glmer, type = "pred", terms = c("Shots"))

The model predicts that the instances of uhm increase with the number of shots. Note that the increase is not homogenous as the y-axis labels indicate! We now compare the predicted number of uhm with the actually observed instances of uhm to check if the results of the model make sense.

library(tidyr)
countdata %>%
  mutate(Predicted = predict(m1.glmer, type = "response")) %>%
  dplyr::rename(Observed = UHM) %>%
  tidyr::gather(Type, Frequency, c(Observed, Predicted)) %>%
  dplyr::mutate(Shots = factor(Shots),
                Type = factor(Type)) %>%
  dplyr::group_by(Shots, Type) %>%
  dplyr::summarize(Frequency = mean(Frequency)) %>%
  ggplot(aes(Shots, Frequency, group = Type, color = Type, linetype = Type)) +
  geom_line(size = 1.2) +
  scale_color_manual(values = c("orange", "lightblue")) 

The comparison between the observed and the predicted uses of uhm becomes somewhat volatile and shows fluctuations after eight shots. We will now summarize the results as if the violations (overdispersion measure > 1 and excessive multicollinearity (singular fit)) had NOT occurred(!) - again: this is only because we are practicing here - this would be absolutely unacceptable in a proper write-up of an analysis!

The summary of the model can be extracted using the tab_model function from the sjPlot package.

tab_model(m1.glmer)
  UHM
Predictors Incidence Rate Ratios CI p
(Intercept) 0.28 0.23 – 0.33 <0.001
Shots 1.26 1.23 – 1.30 <0.001
Random Effects
σ2 0.89
τ00 Language 0.00
N Language 4
Observations 500
Marginal R2 / Conditional R2 0.288 / NA

NOTE

The R^2^ values of the summary table are incorrect (as indicated by the missing conditional R^2^ value). The correct R^2^ values can be extracted using the `r.squaredGLMM` function from the `MuMIn` package. The respective call for the model is:
library(MuMIn)
r.squaredGLMM(m1.glmer)
##              R2m    R2c
## delta     0.2089 0.2089
## lognormal 0.2950 0.2950
## trigamma  0.1204 0.1204

Also note that our model suffers from a serious problem (near singular fit). If this were not just an example, you should not(!) report this model!

Mixed-Effects Quasi-Possion Regression

The Quasi-Poisson Regression is a generalization of the Poisson regression and is used when modeling an overdispersed count variable. Poisson models are based on the Poisson distribution which is defined as a distribution where the variance is equal to the mean (which is very restrictive and not often the case). Quasi-Poisson models scale the standard errors which has a positive effect when dealing with overdispersed data.

Therefore, when the variance is greater than the mean, a Quasi-Poisson model, which assumes that the variance is a linear function of the mean, is more appropriate as it handles over-dispersed data better than normal Poisson-models.

We begin the model fitting process by creating a mixed- and a fixed-effects intercept-only base-line model. Unfortunately, there is not yet a procedure in place for quasi-Poisson models to test if the inclusion of random effects is justified. However, here the Boruta also provides valuable information: Language was only considered tentative but not important which suggests that it will not explain variance which means that including Language as a random effect may not be justified. This would require further inspection. Because we are only dealing with an example here, we ignore this fact (which you should not do in proper analyses) and continue right away with adding shots.

# base-line mixed-model
m0.glmer = glmmPQL(UHM ~ 1, random = ~ 1 | Language, data = countdata, 
                   family = quasipoisson(link='log'))
## iteration 1
## iteration 2
## iteration 3
## iteration 4
# add Shots
m1.glmer <- update(m0.glmer, .~.+ Shots)
## iteration 1
Anova(m1.glmer, test = "Chi")           # SIG! (p<0.0000000000000002 ***)
## Analysis of Deviance Table (Type II tests)
## 
## Response: zz
##       Chisq Df          Pr(>Chisq)    
## Shots   276  1 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA confirms that Shots have a significant impact on the number of instances of uhm. We will now have a look at the model summary.

summary(m1.glmer)
## Linear mixed-effects model fit by maximum likelihood
##  Data: countdata 
##   AIC BIC logLik
##    NA  NA     NA
## 
## Random effects:
##  Formula: ~1 | Language
##         (Intercept) Residual
## StdDev:  0.00004081    1.078
## 
## Variance function:
##  Structure: fixed weights
##  Formula: ~invwt 
## Fixed effects: UHM ~ Shots 
##               Value Std.Error  DF t-value p-value
## (Intercept) -1.2789   0.09649 495  -13.26       0
## Shots        0.2336   0.01408 495   16.59       0
##  Correlation: 
##       (Intr)
## Shots -0.806
## 
## Standardized Within-Group Residuals:
##     Min      Q1     Med      Q3     Max 
## -1.4004 -0.6182 -0.5501  0.5437  4.0248 
## 
## Number of Observations: 500
## Number of Groups: 4

The model summary does not provide AIC, BIC or logged likelihood values. The coefficient for “Shots” is highly significant (p <.001) and the data is notably over-dispersed (the Standardized Within-Group Residuals deviate substantially from a normal distribution with higher values having a thick tail). Also, in contrast to the Poisson model, Language does explain at least a minimal share of the variance now as the mean and standard deviation are no longer 0. Note also, that the coefficients are identical to the Poisson coefficients but the standard errors and p-values differ (the model provides t- rather than z-values).

In a next step, we will calculate the odds ratios of the coefficient (as we only have one). We will use the coefficients from the fixed-effects model as the coefficients for mixed- and fixed-effects models are identical (the random effect structure only affects the standard error and p-values but not the coefficients; you can check by uncommenting the summary command).

m1.glm = glm(UHM ~ Shots, data = countdata, family = quasipoisson(link='log'))
exp(coef(m1.glm))
## (Intercept)       Shots 
##      0.2783      1.2632

The standardized or \(\beta\)-coefficient tells us that the likelihood of uhm increases by 1.26 (or 26.32 percent) with each additional shot.

Before inspecting the relationship between Shots and uhm, we will check if the overdispersion was reduced.

# extract pearson residuals
PearsonResiduals <- resid(m1.glmer, type = "pearson")
# extract number of cases in model
Cases <- nrow(countdata)
# extract number of predictors (plus intercept)
NumberOfPredictors <- length(fixef(m1.glmer)) +1
# calculate overdispersion
Overdispersion <- sum(PearsonResiduals^2) / (Cases-NumberOfPredictors)
# inspect overdispersion
Overdispersion
## [1] 1.006

The overdispersion has indeed decreased and is not so close to 1 that overdispersion is no longer an issue.

We continue to diagnose the model by plotting the Pearson’s residuals against fitted values. This diagnostic plot should not show a funnel-like structure or patterning as we observed in the case of the Poisson model.

# diagnostic plot
plot(m1.glmer, pch = 20, col = "black", lty= "dotted", ylab = "Pearson's residuals")

Indeed, the plot exhibits a (slight) funnel shape (but not drastically so) and thus indicates heteroscedasticity. However, the patterning that we observed with the Poisson model has disappeared. We continue by plotting the random effect adjustments.

# generate diagnostic plots
plot(m1.glmer, Language ~ resid(.), abline = 0, fill = "gray70") 

The adjustments by “Language” are marginal (which was somewhat expected given that Language was only deemed tentative), which shows that there is very little variation between the languages and that we have no statistical reason to include Language as a random effect.

In a final step, we plot the fixed-effect of “Shots” using the “predictorEffects” function from the “effects” package.

plot_model(m1.glmer, type = "pred", terms = c("Shots"))

The effects plot shows that the number of uhms increases exponentially with the number of shots a speaker has had. We now compare the predicted number of uhm with the actually observed instances of uhm to check if the results of the model make sense.

library(tidyr)
countdata %>%
  mutate(Predicted = predict(m1.glmer, type = "response")) %>%
  dplyr::rename(Observed = UHM) %>%
  tidyr::gather(Type, Frequency, c(Observed, Predicted)) %>%
  dplyr::mutate(Shots = factor(Shots),
                Type = factor(Type)) %>%
  dplyr::group_by(Shots, Type) %>%
  dplyr::summarize(Frequency = mean(Frequency)) %>%
  ggplot(aes(Shots, Frequency, group = Type, color = Type, linetype = Type)) +
  geom_line(size = 1.2) +
  scale_color_manual(values = c("orange", "lightblue")) 

Given that the overdispersion measure of this Quasi-Poisson model is close to 1, that the model did not suffer from excessive multicollinearity (singular fit), and because this model shows improvements compared to the Poisson model with respect to the model diagnostics (some adjustments by Language and less patterning in the diagnostic plots), we would choose this quasi-Poisson model over the Poisson model.

Finally, we extract the summary table of this model.

tab_model(m1.glmer)
  UHM
Predictors Incidence Rate Ratios CI p
(Intercept) 0.28 0.23 – 0.34 <0.001
Shots 1.26 1.23 – 1.30 <0.001
N Language 4
Observations 500

Note: The summary tables does not provide R2 values. The R2 values can be extracted using the r.squaredGLMM function from the MuMIn package. The respective call for the model is:

library(MuMIn)
r.squaredGLMM(m1.glmer)
##              R2m    R2c
## delta     0.2636 0.2636
## lognormal 0.3408 0.3408
## trigamma  0.1784 0.1784

Mixed-Effects Negative Binomial Regression

Negative binomial regression models are a generalization of Poisson regression which loosens the restrictive assumption that the variance is equal to the mean made by the Poisson model. This is a major advantage as the most common issue that one faces with Poisson regressions is that the data deviate too substantially from the assumed Poisson distribution.

To implement a Negative-Binomial Mixed-Effects Regression, we first create a mixed-effects intercept-only baseline model and then test if including Shots significantly improves model fit and, thus, has a significant impact on the number of uhms.

# base-line mixed-model
m0.glmer = glmer.nb(UHM ~ 1 + (1 | Language), data = countdata)
# add Shots
m1.glmer <- update(m0.glmer, .~.+ Shots)
## boundary (singular) fit: see ?isSingular
Anova(m1.glmer, test = "Chi")           
## Analysis of Deviance Table (Type II Wald chisquare tests)
## 
## Response: UHM
##       Chisq Df          Pr(>Chisq)    
## Shots    83  1 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The negative-binomial model also reports a significant impact of shots on the number of uhms. We will now inspect the summary.

# inspect model
summary(m1.glmer)           
## Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
##  Family: Negative Binomial(0.7478)  ( log )
## Formula: UHM ~ (1 | Language) + Shots
##    Data: countdata
## 
##      AIC      BIC   logLik deviance df.resid 
##   1051.6   1068.5   -521.8   1043.6      496 
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -0.770 -0.514 -0.468  0.476  2.781 
## 
## Random effects:
##  Groups   Name        Variance               Std.Dev.     
##  Language (Intercept) 0.00000000000000000425 0.00000000206
## Number of obs: 500, groups:  Language, 4
## 
## Fixed effects:
##             Estimate Std. Error z value            Pr(>|z|)    
## (Intercept)  -1.4495     0.1383  -10.48 <0.0000000000000002 ***
## Shots         0.2782     0.0305    9.11 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##       (Intr)
## Shots -0.816
## optimizer (Nelder_Mead) convergence code: 0 (OK)
## boundary (singular) fit: see ?isSingular

In a next step, we calculate the overdispersion.

# extract pearson residuals
PearsonResiduals <- resid(m1.glmer, type = "pearson")
# extract number of betas + predictors + sigma
NumberOfPredictors <- 2+1+1
# extract number of cases in model
Cases <- nrow(countdata)
# caluculate overdispersion parameter
Overdispersion <- sum(PearsonResiduals^2) / (Cases / NumberOfPredictors)# show overdispersion parameter
Overdispersion
## [1] 2.469

The overdispersion has increased which is rather suboptimal. In this case, we would report the Quasi-Poisson Regression rather than the Negative Binomial Model (which is rather rare as Negative Binomial Models typically perform better than (Quasi-)Poisson models. However, this tutorial focuses merely on how to implement a Negative Binomial Mixed-Effects Regression and we thus continue with generating diagnostic plots to check for problems.

diag_data <- data.frame(PearsonResiduals, fitted(m1.glmer))
colnames(diag_data) <- c("Pearson", "Fitted")
p9 <- ggplot(diag_data, aes(x = Fitted, y = Pearson)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dotted")
p10 <- ggplot(countdata, aes(x = Shots, y = diag_data$Pearson)) +
  geom_point()  +
  geom_hline(yintercept = 0, linetype = "dotted") +
  labs(y = "Pearson")
p11 <- ggplot(countdata, aes(x = Language, y = diag_data$Pearson)) +
  geom_boxplot() +
  labs(y = "Pearson") + 
  theme(axis.text.x = element_text(angle=90))
grid.arrange(p9, p10, p11, nrow = 1)

The diagnostics show patterning similar to the one we saw with the Poisson model which suggest that the negative binomial model is also not an optimal model for our data. We continue by plotting the predicted values and, subsequently, summarize the analysis.

plot_model(m1.glmer, type = "pred", terms = c("Shots"))

The effect plot shows that the predicted number of shots increases exponentially with each shot. We now compare the predicted number of uhm with the actually observed instances of uhm to check if the results of the model make sense.

library(tidyr)
countdata %>%
  mutate(Predicted = predict(m1.glmer, type = "response")) %>%
  dplyr::rename(Observed = UHM) %>%
  tidyr::gather(Type, Frequency, c(Observed, Predicted)) %>%
  dplyr::mutate(Shots = factor(Shots),
                Type = factor(Type)) %>%
  dplyr::group_by(Shots, Type) %>%
  dplyr::summarize(Frequency = mean(Frequency)) %>%
  ggplot(aes(Shots, Frequency, group = Type, color = Type, linetype = Type)) +
  geom_line(size = 1.2) +
  scale_color_manual(values = c("orange", "lightblue")) 

The comparison between the observed and the predicted uses of uhm becomes somewhat volatile and shows fluctuations after eight shots. We will now summarize the results as if the violations had NOT occurred(!) - again: this is only because we are practicing here - this would be absolutely unacceptable in a proper write-up of an analysis!

And, we extract the summary table of this model.

tab_model(m1.glmer)
  UHM
Predictors Incidence Rate Ratios CI p
(Intercept) 0.23 0.18 – 0.31 <0.001
Shots 1.32 1.24 – 1.40 <0.001
Random Effects
σ2 1.33
τ00 Language 0.00
N Language 4
Observations 500
Marginal R2 / Conditional R2 0.277 / NA

Note: The R2 values of the summary table are incorrect (as indicated by the missing conditional R2 value). The correct R2 values can be extracted using the r.squaredGLMM function from the MuMIn package. The respective call for the model is:

library(MuMIn)
r.squaredGLMM(m1.glmer)
## Warning: The null model is correct only if all variables used by the original model remain unchanged.
##               R2m     R2c
## delta     0.15514 0.15514
## lognormal 0.27726 0.27726
## trigamma  0.05492 0.05492

A mixed-effect negative binomial regression model which contained the language in which the conversation took place as random effect was fit to the data. Prior to the regression modeling, a Boruta analysis was applied to determine whether any of the predictors had a meaningful relationship with the dependent variable (instances of uhm). Since the Boruta analysis indicated that only the number of shots a speaker had was important, only “Shots” was tested during model fitting. The final minimal adequate model showed that the number of uhm as fillers increases significantly, and near-linearly with the number of shots speakers had (\(\chi\)2(1):83.0, p <.0001, \(\beta\): 0.2782). An inspection of the random effect structure conveyed that there was almost no variability between languages and language did not contribute meaningfully to the model fit.

Citation & Session Info

Schweinberger, Martin. 2020. Fixed- and Mixed-Effects Regression Models in R. Brisbane: The University of Queensland. url: https://slcladal.github.io/regression.html (Version 2020.12.03).

@manual{schweinberger2020regression,
  author = {Schweinberger, Martin},
  title = {Fixed- and Mixed-Effects Regression Models in R},
  note = {https://slcladal.github.io/regression.html},
  year = {2020},
  organization = "The University of Queensland, Australia. School of Languages and Cultures},
  address = {Brisbane},
  edition = {2020/12/03}
}
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.1252    
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] visreg_2.7.0          sjPlot_2.8.6          sfsmisc_1.1-7         sandwich_3.0-0        rms_6.0-1             SparseM_1.78         
##  [7] reshape2_1.4.4        nlme_3.1-149          MuMIn_1.43.17         msm_1.6.8             mlogit_1.1-1          dfidx_0.0-3          
## [13] lme4_1.1-25           Hmisc_4.4-1           Formula_1.2-4         survival_3.2-7        ggeffects_0.16.0      foreign_0.8-80       
## [19] effects_4.2-0         caret_6.0-86          Boruta_7.0.0          hashr_0.1.3           stringdist_0.9.6.3    koRpus.lang.de_0.1-1 
## [25] coop_0.6-2            hunspell_3.0          koRpus.lang.en_0.1-4  koRpus_0.13-3         sylly_0.1-6           textdata_0.4.1       
## [31] tokenizers_0.2.1      readxl_1.3.1          cowplot_1.1.0         magick_2.5.2          officer_0.3.15        gutenbergr_0.2.0     
## [37] flextable_0.5.11      wordcloud_2.6         RColorBrewer_1.1-2    SnowballC_0.7.0       scales_1.1.1          likert_1.3.5         
## [43] xtable_1.8-4          Rmisc_1.5             plyr_1.8.6            lattice_0.20-41       psych_2.0.9           DescTools_0.99.38    
## [49] pdftools_2.3.1        collostructions_0.1.2 igraph_1.2.6          GGally_2.0.0          network_1.16.1        ggdendro_0.1.22      
## [55] slam_0.1-47           Matrix_1.2-18         tm_0.7-7              NLP_0.2-1             tidytext_0.2.6        quanteda_2.1.2       
## [61] gplots_3.1.0          FactoMineR_2.3        exact2x2_1.6.5        exactci_1.3-3         ssanv_1.1             vcd_1.4-8            
## [67] ape_5.4-1             pvclust_2.2-0         NbClust_3.0           seriation_1.2-9       factoextra_1.0.7      cluster_2.1.0        
## [73] QuantPsyc_1.5         MASS_7.3-53           boot_1.3-25           car_3.0-10            carData_3.0-4         forcats_0.5.0        
## [79] stringr_1.4.0         dplyr_1.0.2           purrr_0.3.4           readr_1.4.0           tidyr_1.1.2           tibble_3.0.4         
## [85] tidyverse_1.3.0       cfa_0.10-0            gridExtra_2.3         fGarch_3042.83.2      fBasics_3042.89.1     timeSeries_3062.100  
## [91] timeDate_3043.102     DT_0.16               kableExtra_1.3.1      ggplot2_3.3.2         knitr_1.30           
## 
## loaded via a namespace (and not attached):
##   [1] statnet.common_4.4.1 class_7.3-17         foreach_1.5.1        lmtest_0.9-38        crayon_1.3.4         rbibutils_1.3       
##   [7] backports_1.1.10     reprex_0.3.0         rlang_0.4.8          performance_0.5.1    nloptr_1.2.2.2       glue_1.4.2          
##  [13] parallel_4.0.3       isoband_0.2.2        haven_2.3.1          tidyselect_1.1.0     usethis_1.6.3        rio_0.5.16          
##  [19] zoo_1.8-8            ggpubr_0.4.0         sjmisc_2.8.5         MatrixModels_0.4-1   magrittr_1.5         evaluate_0.14       
##  [25] Rdpack_2.0           gdtools_0.2.2        cli_2.1.0            rstudioapi_0.11      rpart_4.1-15         fastmatch_1.1-0     
##  [31] sjlabelled_1.1.7     sylly.en_0.1-3       xfun_0.19            parameters_0.9.0     askpass_1.1          sylly.de_0.1-2      
##  [37] caTools_1.18.0       TSP_1.1-10           expm_0.999-5         quantreg_5.75        ggrepel_0.8.2        png_0.1-7           
##  [43] reshape_0.8.8        ipred_0.9-9          withr_2.3.0          rle_0.9.2            bitops_1.0-6         ranger_0.12.1       
##  [49] cellranger_1.1.0     e1071_1.7-4          pROC_1.16.2          survey_4.0           coda_0.19-4          pillar_1.4.6        
##  [55] RcppParallel_5.0.2   multcomp_1.4-14      fs_1.5.0             scatterplot3d_0.3-41 vctrs_0.3.4          ellipsis_0.3.1      
##  [61] generics_0.1.0       qpdf_1.1             lava_1.6.8.1         urltools_1.7.3       tools_4.0.3          munsell_0.5.0       
##  [67] emmeans_1.5.2-1      compiler_4.0.3       abind_1.4-5          prodlim_2019.11.13   utf8_1.1.4           recipes_0.1.14      
##  [73] jsonlite_1.7.1       gld_2.6.2            estimability_1.3     lazyeval_0.2.2       latticeExtra_0.6-29  effectsize_0.4.0    
##  [79] sna_2.6              checkmate_2.0.0      rmarkdown_2.5        openxlsx_4.2.3       statmod_1.4.35       webshot_0.5.2       
##  [85] yaml_2.2.1           systemfonts_0.3.2    htmltools_0.5.0      viridisLite_0.3.0    digest_0.6.27        assertthat_0.2.1    
##  [91] leaps_3.1            rappdirs_0.3.1       registry_0.5-1       bayestestR_0.7.5     Exact_2.1            data.table_1.13.2   
##  [97] splines_4.0.3        labeling_0.4.2       broom_0.7.2          hms_0.5.3           
##  [ reached getOption("max.print") -- omitted 56 entries ]

Main page


References

Achen, Christopher H. 1982. Interpreting and Using Regression. Vol. 29. Sage.

Agresti, Alan. 1996. An Introduction to Categorical Data Analysis. Hoboken, NJ: JohnWiley & Sons.

———. 2010. Analysis of Ordinal Categorical Data. Vol. 656. John Wiley & Sons.

Agresti, Alan, and Maria Kateri. 2011. Categorical Data Analysis. Springer.

Austin, Peter C., and George Leckie. 2018. “The Effect of Number of Clusters and Cluster Size on Statistical Power and Type I Error Rates When Testing Random Effects Variance Components in Multilevel Linear and Logistic Regression Models.” Journal of Statistical Computation and Simulation 88 (16): 3151–63.

Baayen, R Harald. 2008. Analyzing Linguistic Data. A Practical Introduction to Statistics Using R. Cambridge: Cambridge University press.

Bell, Bethany A., John M. Ferron, and Jeffrey D. Kromrey. 2008. “Cluster Size in Multilevel Models: The Impact of Sparse Data Structures on Point and Interval Estimates in Two-Level Models.” In JSM Proceedings. Section on Survey Research Methods, 1122–9. American Statistical Association Alexandria, VA.

Booth GD, Schuster EG, Niccolucci MJ. 1994. “Identifying Proxy Sets in Multiple Linear Regression: An Aid to Better Coefficient Interpretation. Research Paper Int-470.”

Bortz, Jürgen. 2006. Statistik: Für Human-Und Sozialwissenschaftler. Springer-Verlag.

Bowerman, Bruce L, and Richard T O’Connell. 1990. Linear Statistical Models: An Applied Approach. Boston: PWS-Kent.

Clarke, Philippa. 2008. “When Can Group Level Clustering Be Ignored? Multilevel Models Versus Single-Level Models with Sparse Data.” Journal of Epidemiology & Community Health 62 (8): 752–58.

Clarke, Philippa, and Blair Wheaton. 2007. “Addressing Data Sparseness in Contextual Population Research: Using Cluster Analysis to Create Synthetic Neighborhoods.” Sociological Methods & Research 35 (3): 311–51.

Crawley, Michael J. 2005. Statistics: An Introduction Using R. 2005. Chichester, West Sussex: John Wiley & Sons.

———. 2012. The R Book. John Wiley & Sons.

Faraway, Julian J. 2002. Practical Regression and Anova Using R. University of Bath.

Field, Andy, Jeremy Miles, and Zoe Field. 2012. Discovering Statistics Using R. Sage.

Green, Samuel B. 1991. “How Many Subjects Does It Take to Do a Regression Analysis.” Multivariate Behavioral Research 26 (3): 499–510.

Gries, Stefan Th. 2009. Statistics for Linguistics Using R: A Practical Introduction. Berlin & New York: Mouton de Gruyter.

Harrell Jr, Frank E. 2015. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. Springer.

Johnson, Daniel Ezra. 2009. “Getting Off the Goldvarb Standard: Introducing Rbrul for Mixed-Effects Variable Rule Analysis.” Language and Linguistics Compass 3 (1): 359–83.

Levshina, Natalia. 2015. How to Do Linguistics with R: Data Exploration and Statistical Analysis. Amsterdam: John Benjamins Publishing Company.

Maas, Cora JM, and Joop J Hox. 2005. “Sufficient Sample Sizes for Multilevel Modeling.” Methodology 1 (3): 86–92.

Menard, Scott. 1995. Applied Logistic Regression Analysis: Sage University Series on Quantitative Applications in the Social Sciences. Thousand Oaks, CA: Sage.

Myers, Raymond H. 1990. Classical and Modern Regression with Applications. Vol. 2. Duxbury Press Belmont, CA.

Neter, J., W. Wasserman, and MH Kutner. 1990. Applied Linear Statistical Models. Regression, Analysis of Variance, and Experimental Design. Irwin, Homewood, USA.

Pinheiro, J. C., and D. M. Bates. 2000. Mixed-Effects Models in S and S-Plus. Statistics and Computing. New York: Springer.

Rousseeuw, Peter J, and Annick M Leroy. 2005. Robust Regression and Outlier Detection. Vol. 589. John wiley & sons.

Szmrecsanyi, Benedikt. 2006. Morphosyntactic Persistence in Spoken English: A Corpus Study at the Intersection of Variationist Sociolinguistics, Psycholinguistics, and Discourse Analysis. Berlin & New York: Walter de Gruyter.

Twisk, J. W. R. 2006. Applied Multilevel Analysis: A Practical Guide. Cambridge: Cambridge University Press.

Wilcox, Rand R. 2009. Basic Statistics: Understanding Conventional Methods and Modern Insights. Oxford University Press.

Zuur, Alain F., Joseph M. Hilbe, and Elena N. Ieno. 2013. A Beginner’s Guide to Glm and Glmm with. R. Beginner’s Guide Series. Newburgh, United Kingdom: Highland Statistics Ltd.

Zuur, Alain F., Elena N. Ieno, and Chris S. Elphick. 2010. “A Protocol for Data Exploration to Avoid Common Statistical Problems.” Methods in Ecology and Evolution 1 (1): 3–14.

Zuur, Alain F., Elena N. ·Ieno, Anatoly A. · Walker Neil J. and· Saveliev, and Graham M. · Smith. 2009. Mixed Effects Models and Extensions in Ecology with R. Springer.